10/603,108 



-5- 



REMARKS 

By the present amendment, the paragraph of the Specification appearing at page at page 
26, line 26 through page 27, line 3, is amended to correct the spelling of "Altsehul" and "ORFs". 
The paragraph of the Specification appearing at page 45, lines 1-1 1, is amended to capitalize the 
trademark "GENBANK". 

Claims 5 and 9 are amended by the present amendment. Claim 5 has been amended 
remove "a complement of SEQ ID NO: 1298". Claim 9 has been amended to delete the term 
"eight" as it relates to the length of the claimed probe and replace it with "forty". Support for 
this amendment may be found, for example, on page 10, line 29 of the Specification. Claims 10- 
31 are cancelled to expedite prosecution, and not to concede to the Office's rejections. 

No prohibited new matter has been introduced by way of the above amendments. 
Applicants reserve the right to file a continuation or divisional application on any subject matter 
cancelled by way of this Amendment. Applicants respectfully request consideration of the 
subject application as amended herein. 

Objections to the Specification 

The Specification stands objected to by the Office because the Title of the invention is not 
descriptive. Specifically, the Office alleges that the Title of the invention is not descriptive since 
the elected claims are directed to polynucleotides. Applicants respectfully submit that Claim 1 is 
directed to an isolated nucleic acid sequence encoding a polypeptide of SEQ ID NO:3218. 
Accordingly, Claim 1 includes a polypeptide encoded by a nucleic acid sequence of Applicant's 
invention and the Title of the invention, which recites "Nucleic Acid and Amino Acid 
Sequences," is descriptive. Reconsideration and withdrawal of the objection is respectfully 
requested. 

The Specification is also objected to because the trademark GENBANK, appearing in the 
Specification is not capitalized and the terms "andORF's" and "Altshal" appear to be misspelled. 
By way of the present amendments, Applicants have corrected such terms and respectfully 
request reconsideration and withdrawal of the present objections to the Specification. 



10/603,108 



-6- 



Reiections under 35 U.S.C. $ 101 

Claims 1-10 stand rejected under 35 U.S.C. § 101 as allegedly not supported by a 
specific, substantial and credible utility. Applicants note that Claim 10 has been cancelled by 
way of the present amendment, thereby rendering any rejection of that claim moot. The 
rejections of Claims 1-9 are respectfully traversed. 

The Manual of Patent Examining Procedure (MPEP) states at § 2107.01, that research 
tools can be "useful" in a patent sense: 

Many research tools such as ... nucleotide sequencing techniques 
have a clear, specific and unquestionable utility (e.g., they are 
useful in analyzing compounds). An assessment that focuses on 
whether an invention is useful only in a research setting thus does 
not address whether the invention is in fact "useful" in a patent 
sense. Instead, Office personnel must distinguish between 
inventions that have a specifically identified substantial utility and 
inventions whose asserted utility requires further research to 
identify or reasonably confirm. 

Therefore, nucleotide sequencing techniques, which can include microbial genomic 
databases containing nucleic acid sequences, amino acid sequences and sequence homology 
information of bacterial genes that are, in turn, useful in the functional analysis of the bacterial 
genome, can meet the utility requirement of 35 U.S.C. § 101 if, for example, the nucleic acid 
sequences and proteins encoded by the nucleic acid sequences have specific, substantial and 
credible utility, such as in the development of antibiotics, diagnostics, vaccines and drugs to treat 
humans afflicted with infection caused by the bacteria. 

Further, the MPEP states, at § 2107.02B, that the utility of 35 U.S.C. § 101 is met, even if 
a specific, substantial and credible utility for the claimed invention is not asserted in the 
specification, if such utility is well-established: 

An invention has a well-established utility if (i) a person of 
ordinary skill in the art would immediately appreciate why the 
invention is useful based on the characteristics of the invention 
(e.g., properties or applications of a product or process), and (ii) 
the utility is specific, substantial, and credible. If an invention has 



10/603,108 



-7- 



a well-established utility, rejections under 35 U.S.C. § 101 and 35 
U.S.C. § 1 12, first paragraph, based on lack of utility should not be 
imposed, (citations omitted). 

In addition, the guidelines for examination of patent applications under 35 U.S.C. § 101, 
"utility" requirement, as shown in the Federal Register, Vol. 66, No. 4, 1092-1099, at 1097, 
provides that: 

Only one specific, substantial and credible utility is required to 
satisfy the statutory requirement. Where one or more well- 
established utilities would have been readily apparent to those of 
skill in the art at the time of the invention, an applicant may rely on 
any one of those utilities without prejudice, (emphasis added) 

Applicants respectfully submit that one or more well-established utilities would be readily 
apparent to one of skill in the art. In particular, one of skill in the art would recognize and 
appreciate the utility of the claimed invention for the purposes of developing new drug targets, 
diagnostics and therapeutics. The Specification, for example at page 2, lines 9-26 provides that 
M. catarrhalis is the most important lower respiratory tract after S. pneumoniae and H. 
influenzae (Doren, G., et al, Diagn. Microbiol. Infect. Dis. 4:191-201 (1986)), and in some 
hospitals M. catarrhalis accounts for half of all respiratory infections (Bluesone, C, et al, 
Pediatr. Infect Dis. J. 1 1 :S7-S1 1 (1992)). Further complicating the treatment of infections 
caused by M. catarrhalis, as stated on page 2, line 27 through page 3, line 4 of the Specification, 
M. catarrhalis has become resistant to many antibiotics, including vancomycin. (Doern, G., et al, 
Antimicob. Agents Chemother. 40:2884-2886 (1996); Wallace, RJ. Am. J. Med. 88:46S-50S 
(1990); Hoppe, HL. Am J. Health. Syst. Pharm. 55:1881-97 (1998)). 

The utility of identifying essential genes for M. catarrhalis, which can be employed, for 
example, to develop new drug targets, diagnostics and therapeutics for the antibiotic resistant M. 
catarrhalis is also discussed, for example, on page 3, lines 12-23 of the Specification: 

The present invention fulfills the need for diagnostic tools and therapeutics 
by providing bacterial-specific compositions and methods for detecting 
Moraxella species including M. catarrhalis, as well as compositions and 



10/603,108 



-8- 



methods useful for treating and preventing Moraxella infection, in 
particular, M catarrhalis infection, in vertebrates including mammals. 

The present invention encompasses isolated nucleic acids and polypeptides 
derived from M catarrhalis that are useful as reagents for diagnosis of 
bacterial disease, components of effective antibacterial vaccines, and/or as 
targets for antibacterial drugs including anti-M catarrhalis drugs. They 
can also be used to detect the presence of M. catarrhalis and other 
Moraxella species in a sample; and in screening compounds for the ability 
to interfere with the M catarrhalis life cycle or to inhibit M. catarrhalis 
infection. 

The usefulness of identifying genes for new therapeutics and diagnostics would be readily 
apparent to one of skill in the art at the time of Applicant's invention. Specifically, the utility of 
genome sequencing information from microbial pathogens, in particular antibiotic resistant 
bacterial pathogens, is well-established in the art. For example, Moir, et al, Antimicrob. Agents 
Chemother. 43: 439-446 (1999), a copy of which is attached hereto as Exhibit 1, provides that 
genomic sequence information has provided a wealth of information to assist in the development 
of strategies for antimicrobial drug discovery, particularly in antibiotic-resistant bacteria. 
Specifically, on page 439 Moir, et al provides: 

Thus, there is little doubt that new antibiotics are needed to combat 
the growing problem of antibiotic-resistant bacteria, and targeting 
of new pathways will likely play an important role in discovery of 
these new antibiotics. In fact, a number of crucial cellular 
pathways, such as secretion, cell division and many metabolic 
functions remain untargeted today. In the last 3 years, high- 
throughput automated random genomic DNA sequencing together 
with robust fragment assembly tools has delivered a wealth of 
genomic sequence information to assist in the search for new 
targets. In many cases, entire biochemical pathways can be 
reconstructed and compared in different pathogens. 

In addition, Tatusov, et al, Science 278: 631-637 (1997), a copy of which is attached 
hereto as Exhibit 2, provides on page 63 1, that comparisons of complete genomic sequences of 
bacteria are useful and can be critically important to the development of targets for new 
antibiotics: 



10/603,108 



-9- 



With multiple genome sequences, it is possible to delineate protein 
families that are highly conserved in one domain of life but are 
missing in the others. Such information may be critically 
important: For example, the families that are conserved among 
bacteria but are missing in eukaryotes comprise the pool of 
potential targets for broad-spectrum antibiotics. 

Smith DR, TIBTECH 14: 290-293(1996), a copy of which is attached hereto as Exhibit 3, 
provides that microbial genome sequence information is useful in new strategies for identifying 
therapeutics and vaccine development. Specifically, on page 293 Smith provides: 



The techniques described in the previous section can be used to 
identify genes in specific functional categories that may represent 
good targets for drug or vaccine development. In general, when 
developing new antibiotics, one is interested in genes that are 
essential under all growth conditions (and preferably even in 
quiescent cells), and for which inhibitors with useful chemical 
properties, such as permeability and low toxicity, can be identified. 
One advantage of having the entire sequence of a genome is that 
targets can be prioritized in terms of their activities and the 
properties of compounds that are known to interact with them. 

In addition, Smith states, on pages 291-292, that the first task in identifying new 
strategies for therapeutics and vaccine targets is to identify genes of the microbial organism and 
that the second task is identifying sequence homology which is useful in the analysis of gene 
products. Specifically, on page 292 Smith provides: 



The second phase in the analysis of bacterial genomes is to identify 
the function of as many genes as possible. Currently, sequence 
homology is the most powerful tool. A high degree of homology 
between the putative translation product of a newly identified gene 
and an enzyme whose function has been thoroughly studied in 
other organisms, provides strong support for the function of that 
protein. 

Applicants respectfully submit that the usefulness of identifying genes for new 
therapeutics and diagnostics would be readily apparent to one of skill in the art at the time of 
Applicant's invention. 



10/603,108 



-10- 



In addition to having identified the claimed nucleic acid of M catarrhalis, Applicants 
have also provided the identified nucleic acid sequence homology, thereby providing strong 
support for the function of the claimed protein. Specifically, Table 2 of the Specification 
provides that the amino acid sequence SEQ ID NO:3218, encoded by nucleic acid sequence SEQ 
ID NO: 1298, asserts homology with H. influenzae acetylglucosamine-1 -phosphate 
uridyltransferase, an essential protein in the synthetic pathway of K influenzae. Furthermore, 
Table 2 provides the score and probability (determined by the BLASTP2 algorithm) and 
homology match to K influenzae acetylglucosamine-1 -phosphate uridyltransferase for 
Applicant's claimed invention, the amino acid sequence (SEQ ID NO:3218) encoded by SEQ ID 
NO: 1298, with the amino acid sequence of H. influenzae acetylglucosamine-1 -phosphate 
uridyltransferase. 

As stated in the Federal Register at Vol. 66, No. 4, at page 1096: 

More specifically, when a patent application claiming a nucleic 
acid asserts a specific, substantial, and credible utility, and bases 
the assertion upon homology to existing nucleic acids or proteins 
having an accepted utility, the asserted utility must be accepted by 
the examiner unless the Office has sufficient evidence or sound 
scientific reasoning to rebut such an assertion. "[A] 'rigorous 
correlation' need not be shown in order to establish practical 
utility; 'reasonable correlation' is sufficient." (citations omitted). 

Thus, nucleic acid sequences and their encoded amino acid sequences, which are 
homologous to known sequences with accepted utility, can meet the utility requirement of 35 
U.S.C. § 101, if, for example, the homologous nucleic acid and amino acid sequences have 
accepted utility and the nucleic acid and amino acids sequences of the invention assert a specific, 
substantial and credible utility, such as the function of the homologous protein. 

As shown in Table 2 of the Specification, Applicants' claimed invention, amino acid 
sequence SEQ ID NO:3218, encoded by nucleic acid sequence SEQ ID NO:1298, asserts 
homology with K influenzae acetylglucosamine-1 -phosphate uridyltransferase, an essential 
protein in the synthetic pathway of K influenzae. 



10/603,108 



-11- 



In addition to the assertions in Table 2, assertions of utility to homologous sequences can 
be found in the Specification, for example, page 2, line 9 through page 3, line 9; page 39, lines 
28-29; and page 44, lines 24-25. A description of Table 2 and well-known software sequence 
comparisons programs, which were employed to identify the homologous sequences, can also be 
found in the Specification, for example, on page 6, lines 24-29. 

Applicants' claimed invention provides nucleic acid sequences which encode 
polypeptides for use in new strategies for diagnostics and therapeutics. Specifically, Applicants' 
claimed invention, which is part of a microbial genomic database of sequences generally referred 
to by Moir, et ah, Tatusov, et al, and Smith, include a wide variety of nucleic acid sequences 
which encode proteins that share homology with known proteins that have utility, several of 
which have been shown to be essential to life of bacteria In particular, the claimed subject 
matter of the instant application has homology with a protein involved in an essential synthetic 
pathway in K influenzae. 

In addition to statements made by the Applicants with respect to utility of the claimed 
invention, for example, page 3, lines 12-23 and page 4, lines 2-21 of the Specification, 
substantial utility is well-established in view of the statements of Moir, et al, Tatusov, et al, and 
Smith that microbial genomic databases, containing nucleic acid and amino acid sequences, are 
useful in diagnostics, the development of new vaccines and in the search for antimicrobial drug 
discovery. The utility is well-established and credible, under 35 U.S.C. § 101, when assessed 
from the perspective of one of ordinary skill in the art in view of the disclosure and the 
statements of Moir, et al, Tatusov, et al, and Smith, that microbial genomic databases, having 
nucleic acid and amino acid sequences, have afforded "new tools to take advantage of genomic 
sequence information in the drug discovery process". (Moir, et al, at 439). 

The Office has not provided any evidence to rebut the assertion that the claimed nucleic 
acid sequence would not have the requisite utility, at least as a target for drug development and 
development of diagnostic tools. Furthermore, the Office has not provided any evidence 
suggesting that one or more well-established utilities would not have been readily apparent to one 
of skill in the art at the time of the invention. Therefore, Applicants respectfully submit that the 
claimed invention meets the requirements of 35 U.S.C. § 101 and withdrawal of the rejection is 
respectfully requested. 



10/603,108 



-12- 



Reiections under 35 U.S.C. § 1 12, first paragraph 

The Office has rejected Claims 1-10 under 35 U.S.C. § 1 12, first paragraph, as allegedly 
". . .not being supported by a specific, substantial, and credible utility. . .one skilled in the art 
clearly would not know how to use the claimed invention." The Office has also rejected Claims 
8 and 10 under 35 U.S.C. § 1 12, first paragraph, as allegedly containing subject matter which 
was not described in the Specification in such a way as to reasonably convey to one of skill in the 
art that Applicants had possession of the claimed invention at the time of filing. Applicants note 
that in order to expedite prosecution, and not to concede to the Office's rejection, Applicants 
have cancelled Claim 10, thus rendering any rejection of that claim moot. The rejections of 
Claims 1-9 are respectfully traversed. 

The Office should "not impose a 35 U.S.C. § 1 12, first paragraph, rejection grounded on 
a 'lack of utility 5 basis unless a 35 U.S.C. § 101 rejection is proper." MPEP § 2107 (IV) at 2100- 
36. As discussed supra, Claims 1-9 comply with the utility requirement set forth in 35 U.S.C. § 
101. Accordingly, withdrawal of the rejection is respectfully requested. 

Claim 8 and 10 are also rejected under 35 U.S.C. § 1 12, first paragraph, as allegedly 
containing subject matter not described in the specification in such a way as to reasonably convey 
to one of skill in the relevant art that Applicants had possession of the claimed invention. 
Applicants have cancelled Claim 10, thus rendering any rejection of that claim moot. 
Specifically, the Office alleges that Claim 8, which depends on Claim 5, is drawn to a method of 
producing an M. catarrhalis polypeptide encoded by a complement of SEQ ID NO: 1298. In 
order to expedite prosecution, and not to concede to the Office's rejection, Applicants have 
amended Claim 5, to remove the reference to "a complement" of SEQ ID NO: 1298, thus 
obviating the rejection of Claim 8. Applicants respectfully submit that the claimed invention 
meets the requirements of 35 U.S.C. § 1 12, first paragraph and withdrawal of the rejection is 
respectfully requested. 

Rejections under 35 U.S.C. § 102(b) 

Claims 9-10 stand rejected under 35 U.S.C. § 102(b) as allegedly being anticipated by 
Wedler, et al (Database sequence GenBank accession number Z72861, version Z72861.1, 



10/603,108 



-13- 



8/1 1/1997). Specifically, the Office alleges that Wedler, et al. discloses a nucleic acid comprising 
a sequence that has 21 contiguous nucleotides of the claimed invention. In order to expedite 
prosecution, and not to concede to the Office's rejection, Claim 10 has been cancelled, thus 
rendering any rejection of that claim moot. The rejection of Claim 9 is respectfully traversed. 

Anticipation requires the disclosure in a single prior art reference of each element of the 
claim under consideration. W.L Gore & Associates v. Garlock, Inc., 220 USPQ 303, 313 (Fed. 
Cir. 1983), cert, denied, 469 U.S. 851 (1984); Cornell v. Sears Roebuck & Co., 220 USPQ 193, 
198 (Fed. Cir. 1983); Verdegaal Bros. v. Union Oil Co. of California, 2 USPQ2d 1051, 1053 
(Fed. Cir. 1987); In re Spada, 15 USPQ2d 1655 (Fed. Cir. 1990); MPEP § 2131. 

In order to expedite prosecution, Applicants have amended Claim 9 to recite a probe 
comprising forty contiguous nucleotides of SEQ ID NO: 1298, thus obviating the rejection of 
Claim 9. Applicants respectfully submit that the claimed invention meets the requirements of 35 
U.S.C. § 102(b) and withdrawal of the rejection is respectfully requested. 



A general authorization is granted to hereby charge any fees or deficiencies to Deposit 
Account No. 501040. In view of the amendments and remarks, it is believed that all claims are 
in condition for allowance, and it is respectfully requested that the application be passed to issue. 
If the Examiner feels that a telephone call would expedite the prosecution of this case, the 
Examiner is invited to call the undersigned at (781) 398-2548. 



CONCLUSION 



By A / LH, 
RoberlG/spadafara/Bwf; 
Registration No. 4#fl97 
Telephone (781) 398-2300 
Facsimile (781)398-2530 




EUTICALS CORPORATION 



Waltham, Maasachu: 
Dated: / J 




Antimicrobial Agents and Chemotherapy, Mar. 1999, p. 439-446 
0066-4804/99/504.00+0 

Copyright © 1999, American Society for Microbiology. Ail Rights Reserved. 



MINDREVIEW 



Genomics and Antimicrobial Drug Discovery 

DONALD T. MOIR, 1 KAREN J. SHAW, 2 ROBERTA S. HARE, 2 and GERALD F. VOVIS 1 * 
Pathogen Genetics Department, Genome Therapeutics Corporation, Wahham, 
Massachusetts 02453-S443, 1 and Chemotherapy and Molecular Genetics, 
Schering-Plough Research Institute, Kenuworth, New Jersey 07033-0539 2 



INTRODUCTION 

The increasing frequency of nosocomial infections due to 
metm'culin-resistant Staphylococcus aureus (MRSA) and van- 
comycin-resistant Enterococcus faecium (VRE) and the fear 
that high-level vancomycin resistance will eventually spread to 
staphylococci underscore the need for vigilance in the continu- 
ing war against pathogenic microbes (18, 39). Current widely 
used antibiotics are targeted at a surprisingly small number of 
vital cellular functions: cell wall, DNA, RNA, and protein bio- 
synthesis (Table 1), and instances of resistance to these anti- 
biotics are widespread and well documented (48). Thus, there 
is little doubt that new antibiotics are needed to combat the 
growing problem of antibiotic-resistant bacteria, and targeting 
of new pathways will likely play an important role in discovery 
of these new antibiotics. In fact, a number of crucial cellular 
pathways, such as secretion, cell division, and many metabolic 
functions, remain untargeted today. In the last 3 years, high- 
throughput automated random genomic DNA sequencing to- 
gether with robust fragment assembly .tools has delivered a 
wealth of genomic sequence information to assist in the search 
for new targets. In many cases, entire biochemical pathways 
can be reconstructed and compared in different pathogens. 
The purpose of this minireview is to indicate where this infor- 
mation can be found, to outline some of the ways in which it 
can be used, and to describe new tools to take advantage of 
genomic sequence information in the drug discovery process. 

Each potential new antibiotic must meet a number of crite- 
ria before it is^approved for use, and the choice of an appro- 
priate target is the first step in this process. It is helpful to 
review the utility of genomic information with regard to some 
of the key criteria which antimicrobial targets must meet. In 
general, (i) a target should provide adequate selectivity and 
spectrum, yielding a drug which is specific or highly selective 
against the microbe with respect to the human host but also 
active against the desired spectrum of pathogens; (ii) a target 
should be essential for growth or viability of the pathogen, at 
least essential under conditions of infection; and (iii) some- 
thing about the function of the target should be known so that 
assays and high-throughput screens can be built. Identification 
of potential new targets can proceed from any one of these 
criteria, but ultimately all must be met by a successful target. 
For example, a variety of methods may be used to find genes 
which are essential for the survival of an organism under de- 
fined conditions or which are necessary for infectivity, in an 
animal model. Comparative genomics may be used to identify 
potential targets which are shared across multiple microbial 

• Corresponding author. Mailing address: Genome Therapeutics 
Corporation, 100 Beaver St, Waltham, MA 02453-8443. Phone: (781) 
398-2313. Fax: (781) 398-2476. E-mail: jerry.vovis@genomecorp.com. 



species. Several tools, primarily sequence similarity based, may 
be used to predict the function of most genes so that specific 
pathways can be targeted. As discussed below, genomic se- 
quence information provides assistance in all of these areas: 
selectivity, spectrum, functionality, and essentiality (Fig. 1). 

CURRENT RESOURCES FOR GENOMIC SEQUENCE 
• AND FUNCTIONALITY INFORMATION 

Numerous databases are now available which contain bpth 
sequence and functionality information. Most of these are ac- 
cessible over the Internet through convenient Web browser 
interfaces. Many also permit downloading of sequence infor- 
mation for use on local servers. Sequence databases now con- 
tain the nucleotide and predicted amino acid sequences of 
virtually every gene in the model microbes Escherichia cok\ 
Bacillus subtilis t and Saccharomyces cerevisiae zs well as in a 
variety of other bacteria (Table 2; a .version of this table is 
updated regularly by The Institute for Genomic Research 
[TIGR] on their Web site: http://www.tigr.org/tdb/mdb/mdb 
.html). These databases are the result of extensive analysis 
of the genomic sequences of those organisms. Open reading 
frames have been analyzed by sequence comparison and by 
codon usage to identify those which are most likely to repre- 
sent transcribed genes. Putative functions have been assigned 
to slightly more than half of the genes in the model organisms 
based on sequence comparisons to genes of known function 
in other organisms, shared sequence motifs, or clustering of 
sequences into related families. Databases such as EcoCyc, 
KEGG, and WIT present these data in an organized and useful 
manner (see Table 3). 

Recently, some commercial databases have also become 
available for nonexclusive use by commercial subscribers. 
These databases generally also provide sequence information 
' not available in public databases and comparative software and 
analysis tools for convenient analysis of the data. For example, 
the results of p rerun sequence similarity searches may be 
stored to provide rapid answers to complex comparative geno- 
mic queries by a subscriber. Finally, several Web-accessible 
sites offer useful tools for sequence analysis via sequence sim- 
Darity searches, motif searches, and structural comparisons. 
Examples of relevant Internet sites providing databases of se- 
quence and functionality information and research tools are 
described in Table 3. 

The next advance in microbial genomics will be the avail- 
ability of the complete genomic sequence from multiple strains 
of a single bacterial pathogen. The discovery of genes con- 
served in multiple pathogenic strains or the recognition of 
genes found only in the most virulent strains are examples of 
the power such genomic comparisons will provide. Sequence 
for a second strain of Helicobacter pyfori has appeared and 



Exhibit 



1 



440 MINIREVIEW 



TABLE 1. Gene targets of widely used antibiotics 



Target category and 
gene product 

Protein synthesis 
3 OS ribosomal subunir 



Antibiotic class 



50S ribosomal subunit — 
iRNA 11 * synthetase — — 
Elongation factor O 



...AminogJycoKides, tetracyclines 



^.Macrolides, chloramphenicol 
...Mupirocin 



..Fusidic acid 



Nucleic add synthesis 
DNA gyrase A subunit; topo- 
isomerase TV . 



DNA gyrase B subunit 

RHA polymerase beta subunit — ~~ 
DNA 



...Quinolones 
...Novobiocin 
Rifampin ' 
^Metronidazole 



Cell wall peptidoglycan synthesis 

Transpep u'd ases •— 

D-Ala-D-Ala ligase substrate — 



....Beta-laciams 
w .Grycopeptides 



Antimetabolites 

Dihydrofolaxe reductase... 

Dihydropteroate synthesis 

Fatty acid synthesis— — 



..Trimethoprim 
.Sulfonamides 
..Isoniaiid 



sequence for a second strain of Mycobacterium tuberculosis will 
appear soon (Table 2). 

COMPARATIVE GENOMICS TO ASSESS THE 
SPECTRUM AND SELECTIVITY OF A TARGET 

One powerful use of genomic sequence information is to 
compare all of the identified genes in diff erent .bacterial patho- 
sens to determine which genes are, or are not, shared by var- 
ious species. Indeed, Tatusov et aL (50) have suggested that 
gene families conserved among bacteria but missing from eu- 
karyotes comprise a pool of potential targets for broad-spec- 
trum antibiotic development. An early step in this direction 
was taken by Mushegian and Koonin (36), who identified 
256 genes shared by the two completely sequenced bacterial 
Renomes at that time, those of Haemophilus influenzae and 
Mycoplasma genitalium. On the other hand, genes which are 
a4arentty unique to a species such as H. pylori might be ideal 
for targeting that species with a narrow-spectrum antibiotic As 
the number of sequenced bacterial and fungal genomes grows, 
so does the ability to find genes common to most microbial 
pathogens or truly unique to a particular species. For example, 
Arigoni et al. (6) identified 26 genes in E. coh } most of which 
were conserved in the S. subtilis, M genitalium t H. ^fl^raae, 
H. pylori, Streptococcus pneumoniae, and Borrelui buridOTpi 
genomes. They reasoned that this list of genes, which had no 
predictable function, contained novel targets for broad-spec- 



Axtimicrob. Agents Chemother. 

trum antibiotic development. These analyses can be extended 
by including sequence comparisons to eukaryotic I** 0 ™** a 
means to examine potential selectivity of a target (50). For 
example, Arigoni et aL (6) reported that 15 of 26 proteins 
broadly conserved across bacterial species also exhibited sig- 
nificant sequence similarity to proteins in S. cerevisiae and, 
therefore, represented targets which, in an assay, might iden- 
tify compounds that also have human toxicity. While these 
targets could simply be avoided, it should be noted that the 
targets of the majority of marketed antimicrobial agents show 
some conservation with mammalian proteins. 

As in all sequence comparisons, the search parameters and 
the quality of the input data, e.g., partial human or mammalian 
sequence information, are critical Relevant issues which must 
be addressed include questions such as the fofl owing.' What 
degree of sequence similarity to another bacterial genome 
indicates a shared gene7 What degree of sequence similarity to 
a mammalian gene warns of a possible toxicity problem? Since 
sequence similarity-searchmg algorithms allow nearly com- 
plete flexibility in the choice of these parameters, some known 
examples are necessary to calibrate the method. Mushegian 
and Koonin (36) used a BLAST? score of 90 as the cutoff for 
defining a biologically relevant relationship between two pro- 
tein sequences. The appropriate cutoff score for exclusion of 
genes with apparent mammalian homologs may be more gene 
• specific Some examples reveal a general trend. Trimethoprim 
is a nighry selective inhibitor of bacterial dihydrof olate re- 
ductase* (DHFR) despite the fact that the human and R co& 
DHFR gene products share 2B% amino acid identity over the 
length of the two proteins (40). Similarly, the quinolones are 
highly selective against bacterial gyrases despite the fact that 
the ^terminal domain of human topoisomerase II shares 20% 
amino acid identity with £ coU gyrase A (25). Fluconazoles are 
highly selective for fungal lanosterol 14-a demethylases, even 
though the human and yeast gene products share 37% amino 
acid identity over their full length (5). These sequence identity 
percentages translate Into BLAST? scores of 132, 125, and 
301 respectively, in a search of a large nonredundant protein 
database comprised of sequences from GenBank, SwissProt, 
and PIR. Therefore, exclusion of genes having apparent mam- 
malian homologs with scores >150 would likely be suitable for 
a search of bacterial targets, but the score cutoff would have to 
be raised to allow identification of the broadest set of antifun- 
gal target genes. 

IDENTIFICATION OF ESSENTIAL TARGETS 
EXPERIMENTALLY 

Genomic sequence information is not required for discov- 
ering essential genes, but such information does facilitate the 
process. Genes which are essential to pathogenesis and prevent 



Comparative Otn amies 
Sdecavityfc Spectrum 



pcaiures 



Validation of 
Bty 



Validation of 
Expression 



kssaySt 
HTS 




Sequence similarity 
Motif content 
Proitm interaction 
Operon neighbor! 
Structural strnllariry 



Directed allele exchange 
Random transposon -based: 

Oenetic footprinting 

GAMBIT 

STM 
Conditional lethali 
[VITA 



Micro-array 

hybridization 
RT-PCR 
IVBT 
DH 



Biochemical assays 
Whole-cell assays 
Ugimd-aflinily BKK-a)* 



FlQ. t vi,w of genomic .ooU oppUcd to anomiaob^ ». So. the «xt (or Q + vc *,d 0-vo. gnn, positn* on 4 

respectively. 



VOL. 43, 1999 



MLhOREVIEW 441 



TABLE 2. Sequenced microbial genomes 



Internet resource 



Genome 



Straio(fi) 



Size 
(Mb) 



Institution (s) 



Reference 



Haemophilia influenzae RD 
Mycoplasma genitahum 
Methanococaa jannaschO 
Synechocysta sp. 
Mycoplasma pneumoniae 

Saccharomyces cerevisiae 



wvw.tigr.org/tdWmdb/bidWhjdbJitniJ 
www.tigr.org/td b/m db/nigdb/mgdb-html 
www.ngr.or^tiJb/mdb/nydb/mjdb.btmJ 
www.kazusa.or jp/cyano/cyan o Jitml 
wwwamhh.unj-heldelberg.dB/Mj3neumonkfe/ 

MP_Home.htm) 
Epeedyjo^biochenunpg-de/mips^eost^enst 
jgename.htmlx or genome-www.stanford 
.edu/SaDcharonryees 
www.dgr.org/tdb/moVhpdb/hpdb.html 
www.genetics.wise.edu/ 
www.genomecCTp.com/gcne/8equenceV 
nieibanobocter/DbstrncuhtmJ 
w.pasteurJr/Blo/SubtiList.btmJ 
w.a'gr.org/tdb/mdb/Bfdb/afdbJitmJ 
www.tj gr.org/ tdb/mdb/bbdbft>bdb.htraJ 
wwwjicbLnlmJiih.gov/Dgi-bin/Entrez/ 

fraraik?db-GeDome&gi* 133 
www.bloJ3ite.go.jp/ot3db_bdexhmiJ 

wwwj^ger.ac.uJc^roject^_tiibereulosis/ 
www.tigr.org/in^/rod^/tpdb/tpon.bizr_ 
chlamydiB-www.berkeley.edu:423iy 

evoludon.bmc.uu^^v/gnomic^cktusia.fatml Ridccmia pnMvxckS 
www.genomeeorpxom/bpylori or wwwjastro- Helicobacter py fori 

boston.com/hpylDrI ... 
www.tigr.org/dg-bhVBIiistSearch/blnst Mycobacterium tuberculosa 

.cgl? organism — m_tubercul o&ii 



Helicobacter pylori 
Escherichia coU 
Me&tcmobacterium thermo- 

autotrophicum 
Bacillus subtilu 
Archaeoglobus julgidus 
BoTTzua burgdorferi 
Aqulfez aeoUcus 

Pyrocoaus hor&oshE 

Mycobacterium tuberculosis 
Treponema paBdum 
' Chlamydia trachomatis 



• KW20 
G-37 

DSM2661 
PCC 6BQ3 
M129 

S288C 



26695 
K-12 
delta H 

16B 

VCM6,DSM43D4 

B31 

VF5 

0T3 

H37TLv 
Nichols 

Seruvur D p/UW3/Cx) 

Madrid E 
J99 

CSU#93 ' 



1.B3 TOR 13 

OiB TOR 15 

1.66 TOR 8 

3.57 Knzuso DNA Research Institute 27 

D.81 University of Heidelberg 23 

13 European and North American 17 
Consortium 

1.66 TOR 51 

4.6 University of Wisconsin 7 

1.75 Genome Therapeutics and Ohio 43 

State University 

42 International Consortium 31 

2.1B TIG*. 2» 

1.44 TOR 14 

XSS Diverse 10 

1.8D National Institute of Technology 2B 

and Evaluation 

4.40 Sanger Centre 9 

1.14 TOR and University of Texas 16 

1.05 University of California at Berk©- 46 

ley and Stanford University 

1J1 University of Uppsala 4 

1.64 Genome Therapeutics and 3 

AstraAB 

4.40 TTGR Unpublished 



colony formation in a ronditional-lethaJ manner are potential 
targets for new antimicrobials. This assumes that a small or- 
ganic molecule which inhibits the activity of an essential gene 
product would either kill or inhibit the growth of the bacterium 
which requires that functional protein. Such conditional lethal 
genes can be discovered through classical mutagenesis tech- 
niques. Availability of the sequence of the genome means that 
the full sequence of each mutated gene, and frequently its 
cellular role as well, can be gleaned from a short sequence read 
on a complementing plasmid insert. This additional informa- 
tion accelerates the processing of a mutational study enor- 
mously. Depending on the availability of genetic tools for the 
microbial species in question, a variety of molecular genetic 
methods can be used to discover essential genes. For example, 
in £ colt, genes can be placed under control of a regulated pro- 
moter by use of an appropriately constructed transposon sys- 
tem (11), or genes can be mutated to a conditional-lethal form. 
In principle, such conditional mutants can be used in whole- 
cell screens under moderately suppressing conditions in which 
the ceDs may be hypersensitive to drug-like compounds which 
act against that gene product (see below). j 

It seems reasonable, to assume that most genes which are 
essential to the ceD for growth or viability on laboratory media 
will also be required for growth or viability in an infected host. 
Experimentally, media can be varied in order to identify' genes 
which are essential under the widest range of growth condi- 
tions and particularly in rich media which may simulate con- 
ditions in necrotic tissue of an animal host Cells carrying 
auxotrophic mutations may find sufficient nutritional supple- 
ment in the host tissues to permit growth or at least survival. 
Such genes might be poor targetB for new antirnicf obials unless 
experiments establish that the particular nutrient is in short 
supply in the host or that cells are incapable of transporting the 
nutrient efficiently. In order to establish that a gene target is 
essential in an infection, a transposon-based gene tagging 



method called "signature-tagged mutagenesis" (STM) has 
been used to identify genes which are essential in an animal 
model (22, 35). However, since cells carrying the disrupted 
tagged genes must be grown in the laboratory prior to intro- 
duction into the animal, the method may be biased against 
gene6 which are essential for growth both on laboratory media 
and in an animal model Indeed, many of the genes identified 
by STM appear to encode virulence factors which affect the 
' ability of the pathogen to colonize or damage host tissue rather 
than the viability of the pathogen. New drugs which intervene 
in these processes could prove highly selective, a-nd resistance 
to such drugs might be rare since loss or mutation of the. 
virulence factor would also likely reduce virulence. However, 
other resistance mechanisms, such as drug modification and 
efflux pumps, could be problematic. In addition, trie absence of 
a convenient in vitro assay for such drugs would hamper the 
development, testing, and approval processes. It remains un- 
clear how many important antimicrobial targets would be 
missed hy using as targets for drug discovery only those genes 
which are essential for growth or viability on laboratory culture 
media, , 

A related, important feature of a suitable antimicrobial gene 
target is its expression pattern in the infection. The absolute 
level of expression may be less important than, information 
about whether it is expressed at aD. A highly expressed, abun- 
dant gene product should be no more difficult to ijihibit than a 
low-abundance gene product since an inhibitor with suitably 
high affinity will be effective in either case- unless it is poorly 
taken up by pathogens. However, if a gene is noc expressed at 
all in an established infection of an animal host, then it will be 
of no interest as a potential target A gene already established 
as being essential for growth or viability in the laboratory by 
genetic methods obviously must be expressed under these con- 
ditions because its failure to be expressed as an a-ctive product 
causes the pathogen to die. Knowledge that sucia an essential 



442 MINIREVIEW 



Antimicrob. Agents Chemother. 



gene is also expressed in an animal model would suggest that 
it is essential in an infection as well. Two types of methods offer 
information about gene expression. First, for genes whose se- 
quence is known, reverse transcriptase PCR (RT-PCR) may be 
used to detect transcripts in cells grown on agar media or in 
animal infection models (47). Alternatively, for organisms 
which have been sequenced in their entirety, a whole-genome 
view of gene expression may be obtained by gridding clones, 
PCR products, or synthetic oligonucleotides representing ev- 
ery gene onto a solid support Total RNA maybe isolated from 
cells grown under conditions of interest, labeled, and hybrid- 
ized to the array (12). WhOe thorough, this type of method 
suffers from some problems: (i) appropriate controls must be 
run to eliminate the possibility of bacterial DNA contamina- 
tion in the RNA preparation, (ii) probes are difficult to prepare 
because bacterial mRNA is notoriously unsiable, and (iii) the • 
whole-genomic scale of the experiments makes the arrayed 
membranes difficult and expensive to prepare and read. A 
genetic promoter trap method termed "in vivo expression tech- 
nology" or IVET may be more feasible for most laboratories 
(21> 33). In this approach, which has been developed for use in 
Salmonella typhimunum grown intraperitonealry in BALB/c 
mice or in cultured macrophages, random DNA fragments are 
cloned upstream from a gene whose expression is required for 
growth in an animal host Cells, which multiply In vivo, are 
recovered and cloned. The sequences of fragments serving as 
functional promoters in vivo are then determined. A second, 
related promoter trap method termed "differential fluores- 
cence induction" (DFI) has been described recently (53). The 
distinguishing features of this approach are that (i) the gene 
used for selection encodes a modified green fluorescent pro- 
tein and (ii) the selection is accomplished with a fluorescence- 
activated cell sorter. If such methods can be extended to other 
bacterial species and animal hosts, they will be extremely use- 
ful for assessing random genomic fragments or specific genes 
of interest for expression in vivo. 

IDENTIFICATION OF ESSENTIAL TARGETS 
USING DATABASES 

Potential gene targets selected from databases can be vali- 
dated by examining the effect of a gene knockout on cell growth 
or viability. Recombination is almost exclusively between ho- 
mologous regions in bacterial genomes, and many common 
pathogens as well as model bacteria are transformable. Ex- 
change between the chromosomal wild-type allele and a ver- 
sion engineered to carry a deletion and/or an insertion of a 
drug resistance cassette is generally efficient enough to be 
practical in the laboratory. Interpreting the results of such an 
experiment, however, mBy be difficult for two reasons. Fust, 
the frequent occurrence of polycistronic messages in bacteria 
means that disruption of a gene may have a deleterious effect 
on expression of a distal neighboring gene, a so-called "polar" 
effect In that case, the inviabOity caused by a gene knockout 
could be-due to loss of expression of a gene other than the one 
disrupted. Precautions can be taken to reduce these effects by, 
for example, including a moderate-strength outward reading 
promoter in the disrupted version of the allele so as to permit 
expression of the downstream gene(s). Second, the method 
works better as an exclusionary tool than as an indusionary 
one. While success in generating a cell carrying a disrupted 
aDele indicates that the gene is not essential for growth or- 
viability of the cell, failure to generate such an altered cell 
could be due to any one of multiple causes including polar 
effects or inefficient recombination in a particular genetic 
interval. 



TABLE 3. Additional Internet resources 



Database or organization 



Internet address 



NCBI- 



DDBJ., 



-.hrtptfwww.ncbi.n lm-aih.gov/Entrei/ 

Genome/org.html 
Jjttpy/v^^.ddbjjiig.ae.jp/b tmls_tert/ 



EBI/EMBL. 
GSDB. 



Wtlcome-e±traJ 
..httr^/www.ebUe.nk/ebi_h orne.html 



SwissProt (Geneva)- 
Candida — — 



iitrp^ww.ncgx.org/gsoVlBdex jjsdbJitml 
Jjttp^/e^ajyJinige.ch/«ww/expasy-top 



Jrtml 

h trp;//aJeei.med. umo.edu/Caa didn.htmJ 



RDP- 
SGD.. 



^brtpy/Ww.mips.biQchtm-rnpg.de/ 
Jittptftoip.iife.uiue.edu/ 
..hrrp^geTiomt-wvAvatanfOTd.odu/ 



Metabolic 
KEGG« 
Bcocyc, 



WIT- 



..ho^/www.genome.ndjp/kegg/ 
htrp^/ccocyc^PcngtnSystcrnsxom/ 
ecocycyecocycJitnil 



Sequencing groups 
Berkeley- 
Genome Therapeutics „ 
Sanger- 



Stanford M 
TOR.. 



httpV/chJam)difl-www.berkBley.edu:423y 

h np -Jfwww .gen omecorp .eo m/hom e.hti 

■Jittp^/ywwjtinger.ac,uk^rojectt/ 
.Jvttp ://s n qu e n dc - vww .stznf ord.edu/group/ 



University of Oklahoma— 
University of Queensland- 
University of "Washington » 

Washington University « 



maJaria/in dci.hrml 
M htrpv7v^w.tigr.org/to^/mdb/mdb.html 



htrpV/faal.c^cm.uosBm.cdu/todeiJitml 
— http^/www.cmckuq.eahi^u/acniginosD/ 
— http^/chimertubioteeh.wa£ hingTon.edu/ 
uwge/ 

htrpV/geo amcwustLedu/gs c/bacteriaj/ 
saJmooeua-htraJ 



Tools and resources 
Biomolecular Research Tools ~-^ttp^/w\vw.pubUcJastatt.edu/--pedrQ/ 
rt.Uttmi 

COCt btrp?/WwjicbLnlmjiiLgov/COO/ 



NCGR- 



^hrrp^vw.ncgr.org/micmbe/Indei_bns 



MAGPIE., 



.html 

^JlttpyAwwjniXJml^ov/bDCTe^nasterV 



genomes Jj tml 

Genobase br^/specicx.ocrLcih.gDv:80M/ 

Micro Undergrounds. , — hrtp y/wwwJsumc. ed u/ounpus/m i a/mirro xj 



ANMR-. 
WHO — 
Pali en — 



pub Heji trnj/in d ociitml 
^t^^/www.wdcmxiken.go . j pf 
.Jj ttp:/Aww. who-cb/WeJcom eJi unJ 



CDC. 



University of Kansas. 

University of Georgia- 

Tripos.— 

Motif 

Pedaou 



±trp^/Ww.c^nw jic.uk/-rh bmOOl/fcetbook/ 

chapter.html 
.Jmp;//www.cdc^ov/ 



i ttp^/*Tvw.lojmtxdu/re£C arcb/f gsc/main 
Jitznl 

attp^/ruDgus.gcnttics.ugu. cdiuSOBIV 
- Jjttp^Avww.triposxom/sites .html 



.Jittp^/dna^tanford^DU/idendxy/ . 
ittp^danLmips.bjochcm.mpg.de/ 



GeaTHREADER ... i J>ttpV/globmi)io.WBj^icLa&uk/genome/ 

genomichtmJ 



One solution to this problem is to carry out allele exchange 
as a two-step process (20, 32). In EL coli t for example, the dis- 
rupted allele together with the vector carrying it can be 
integrated into the genome by means of a singl e crossover, 
a so-called "Campbell insertion." Recombination "between ho- 
mologous regions on the two copies of the allele now on the 
chromosome will eliminate the vector sequences and either 
copy of the allele. Which copy is eliminated d expends upon 
which regions of homology were involved in th& recombina- 
tion. Failure to find cells retaining only the disrupted allele 
strongly suggests that such progeny are irrviable - Success in 
finding cells retaining only tie wfld-rype allele confirms that 



Vol 43, 1999 



MINIKEVIEW 443 



recombination is efficient in this genetic interval. However, in 
many naturally competent bacterial species, such as H. influ- 
enzas and S. pneumoniae, double-crossover events are ex- 
tremely efficient, and allele replacement occurs with little or no 
opportunity to isolate a single crossover intermediate (1). 
While this complicates evaluation of essential genes in these 
organisms, it provides a convenient method for disrupting 
genes under conditions in which they are not essential so that 
the resulting strains may be examined under a variety of other 
conditions (e.g.,'see below). 

A new approach promises to accelerate the process of eval- 
uating the essentiality of genes. Smith et al. (44, 45) have 
described a method for the yeast S. cerevisiae called "genetic 
footprinting" which makes use of a quasi-random transposable 
Ty element to generate a rich array of gene knockouts in a 
.population of cells. Further transposition is shut off, and the 
population is then grown under a variety of conations, DNA is 
prepared from cells in the various growth populations, and the 
DNA is queried by PCR amplification to determine if it wiD 
yield PCR products between a gene-specific primer and a 
transposon-specific primer. Failure to find such PCR products 
suggests that cells carrying transposons in that gene were unvi- 
able under the growth conditions employed Fluorescent PCR 
products are viewed on standard sequencing gels by using 
automated fluorescence sequencing machines and a commer- 
cially available software package. An important control in this 
method is the existence of a gene-to-txansposon PCR product 
in the so-called t 0 cell population prior to the shutdown of 
transposition. This assures the experimenter that this region is 
not simply a "cold" spot for transposition. The efficiency of this 
method derives from the use of random transposons to build 
all necessary gene knockouts rapidly, followed by automated 
PCR and analysis methods to interpret the results for any given 
gene of interest 

Recently, a modified version of this method, called "geno- 
mic analysis and mapping by in vitro transposition" (GAM- 
BIT), has been applied successfully to two bacterial species (1). 
In this variation of genetic footprinting, the transposition mu- 
tagenesis was done on PCR-amplified genomic segments from 
J£ influenzae or S. pneumoniae in vitro, and the mutations were 
introduced into these naturally competent host bacteria by 
transformation. While the method suffers from the absence of 
a true t 0t the focus on 10-kb DNA segments permits near- 
saturation mutagenesis with the manner family transposon Hi- 
marl, which shows little or no insertion site specificity. These 
authors identified four essential conserved genes of unknown 
function from a total of 13 analyzed 

Currently, the main limitation to this method is a require- 
ment for an efficiently transformable host bacterium so that 
mutations generated in vitro can be evaluated readily in Yivo. 
Other limitations which apply to all genetic footprinting meth- 
ods include the following: (i) essentiality of the function of a 
gene that is duplicated or has a functional paralog cannot be 
analyzed, since footprinting assesses the fitness of a single 
mutagenized gene; (It) polarity effects, although not a problem 
for S. cerevisiae, may lead to misinterpretation of data obtained 
from bacteria; (iii) the correlation of footprinting data with 
gene knockout data has not been confirmed in any organism; 
and (iv) footprinting data are technically difficult to interpret 
for a variety of reasons, including the facts that some essen- 
tial genes will- tolerate insertions in the C-terminal coding 
region (e.g., secA [1]) and cells carrying insertions in some 
genes display an intermediate slow-growth phenotype (e.g., 
ade2 [44]). 



TOOLS FOR PREDICTING THE FUNCTION 
OF GENE PRODUCTS 

Clearly, not all of the predicted functional assignments 
based on sequence similarities are reliable. In some cases, for 
example, the function of the closest-related protein has itself 
been predicted based on its sequence similarity to a gene 
product of known function. In other cases, the chain of relat- 
edness to a protein of confirmed function may be even longer. 
About half of the genes in bacterial genomes either lack sig- 
' nificant enough sequence sirrulariry to permit functional as- 
signment or have likely homologs whose function is unknown. 
In neither of these cases can a function be predicted for the 
gene product. Nevertheless, the results of sequence similarity 
searches are a useful starting point for further investigation 
More sensitive sequence comparison searches may provide a 
putative function or functional feature such as the presence of 
a short protein sequence motif. For example, a search against 
a database of dusters of orthologous groups of genes (COGs 
[Table 3]) yielded over 100 additional functional predictions 
for genes in the K pylori genome (50). 

Tools other than sequence similarity have also been useful in 
a few cases for predicting function of a gene product For 
example, a gene product, with no significant sequence relation- 
ship to a protein of known function but which is likely to be 
co transcribed as part of a polycistronic message with other 
genes of known function, may play a role in the same pathway 
with the known gene products. In the K coli genome, the 
hypothetical gene yjaF appears to be cotranscribed with the 
porphyrin biosynthetic gene kemE, and the hypothetical gene 
yadM appears to be in an operon with the outer-membrane 
usher protein HtrE, which is involved in transport and binding. 
It is reasonable to speculate that these genes of unknown 
function play roles in the same biochemical pathways as their 
neighboring "blown" genes. Of course, experimental evidence 
would be required to confirm these hypotheses. Methods also 
exist for identifying likely structural similarity even in the ab- 
sence of strong primary sequence similarity. As the databases 
of known structures grow, this will become a powerful ap- 
proach for assigning likely functions to gene products. For 
example, the "GenTHREADER" web site (Table 3) presents 
analysis results from a fast fold recognition program on the 
predicted open reading frames from three bacterial genomes. 

Laboratory methods can also be invoked to solve questions 
of unknown gene identities. An unknown gene may be used as 
the bait in a yeast two-hybrid interaction trap to identify genes 
whose protein products interact with the unkno'wn protein. 
The identity of an interacting partner will frequently implicate 
the unknown in a particular cellular pathway (19) * Finally, an 
unknown gene may be expressed as a tagged fusion, the protein 
purified by affinity column, and the product tested for catego- 
ries of activities such as proteolysis, DNA cleavage or binding, 
ATP or GTP hydrolysis, and binding, to name a few. The prob- 
ability of successfully identifying an activity of an linknown by 
the latter method is low, but this method may be warranted if 
sequence comparisons suggest the presence of a motif assod- 
ated with an assayable function. An attractive alternative is to 
focus on assays which do not require knowledge of the cellular 
function of a gene product (see below). 

THE FUTURE: DEALING WITH GENE TARGETS 
HAVING NO PREDICTABLE FUNCTIONAL FEATURES 

The array of tools described so far, including comparative 
genomic methods for identifying potentially useful gene targets 
and allele exchange methods for validating the essentiality of 



444 KGNIREVIEW 



Antimjcrob. Agents Chbmother. 



those genes, provides both gene targets whose cellular function 
can be predicted and gene targets for which little or no func- 
tional information is available. Targets in the first class may 
be used immediately to build biochemical assays and high* 
throughput screens to detect small organic molecules which 
inhibit the biochemical activity. Typically, the gene sequence is 
amplified by PCR from genomic DNA of a given bacterium, 
inserted into an expression vector, and expressed in R coli 
sometimes with affinity tags to facilitate purification of the 
resulting protein product 

It is far less obvious how to proceed with gene targets lacking 
any functional information. This problem has attracted consid- 
erable attention in recent years because of the growing number 
of such targets known to be shared across many bacterial 
species (24), some of which are known to be essential in at least 
one species. As a general guide, about 40% of bacterial genes 
cannot be assigned a putative function at this time. If 10. to 
15% of these jjenes are essential, then 4 to 6% of the genes in 
a typical bacterial genome (about 100 genes) represent po- 
tential antimicrobial targets which have never been used in 
screens. Three basic types of approaches seem feasible and 
have shared some initial success. First, cells expressing higber- 
or lower-th an -normal levels of particular genes have in some 
cases been shown to be more resistant or more sensitive, re- 
spectively, than their wild-type parents to chemical compounds 
laiown to inhibit those gene products. For example, overex- 
pression of the ytzstALG7 gene results in cells more resistant 
than wild-type cells to tunicamycin (3S), while reduced activity ' 
of the same gene product results in cells more sensitive to the 
drug (30). Similarly, increased expression of the ERG11 gene 
in Candida glabrata results in higher levels of resistance to the 
azole family of drugs which target that enzyme (54). A gene of 
unknown function could be overexpressed in a host strain, and 
the resulting assay strain could be tested for increased resis- 
tance to a library of compounds-. It is clear, however, that many 
gene targets when overexpressed do not lead to resistance to 
chemical compounds that are known to bind to the protein 
product (e.g., gyrA [52]). Furthermore, overexpression of pro- 
teins often leads to lethality or growth defects (e.g., kasA [34]). 
Alternatively, a gene could be underexpressed or crippled by a 
mutation so that cells might show increased sensitivity to a 
compound which inhibits the protein product Scientists at 
Microcide Pharmaceuticals, Inc., have applied this approach 
on a large scale using temperature-sensitive mutants grown at 
intermediate temperatures in order to reduce the level of-ac- 
tivity of the target gene product (39a). Of course, it is not clear 
what fraction of unknown gene products would provide the cell 
with increased drug resistance or sensitivity when over- or 
underexpressed in these ways. 

The second approach to this problem of assaying gene prod- 
ucts of unknown function is probably more generally applica- 
ble. Libraries of small molecules are screened for strong bind- 
ing affinity to proteins of unknown function. This has. been 
achieved with peptides in phage display libraries because bind- 
ing can be readily detected by elution of bound phage from the 
protein tethered on a solid support. Proteins of unknown func- 
tion can be produced easily as affinity fusion products for 
attachment to solid supports, and a variety of peptide phage 
display libraries are commercially available. ConformationaDy 
constrained disulfide-bonded peptides with affinities in the 100 
pM to 100 nM range can be obtained by this approach (55). Of 
course, not all peptides detected by this approach will bind to 
sites which inhibit activity, but an elegant new method, caBed 
"validation in vivo of targets for anti-infectives" (VITA), has 
been devised to identify those peptides which inhibit essential 
cellular functions (49). Potential inhibitory peptides were ex- 



pressed in a regulated manner within bacterial host cells which 
were grown either on agar medium or in an animal model of 
infection. Inhibition of cell growth or viability upon induction 
of peptide expression validated the peptide-protein interaction 
as useful for further drug development While peptides are not 
ideal drug candidates, a wider array of techniques are appli- 
cable after a moderate binder has been obtained. The peptide 
may be used as a surrogate ligand in a competition assay to 
identify a small organic compound with higher affinity. Scintil- 
lation proximity assays (26) or fluorescence polarization assays 
(41) may be used in a high-throughput mode to identify com- 
pounds in chemical libraries which compete for binding with a 
labeled peptide. Alternatively, ligand binding assays may be 
configured to work directly Dn libraries of unlabeled chemical 
compounds. Shuker et aL (42) have described a nuclear mag- 
netic resonance-based method capable of a throughput of 
1,000 compounds per day. Mass spectrometric methods are 
also of interest as potentially rapid ways to detect' bound li- 
gands from chemical libraries. One concern about these ap- 
proaches is that proteins may have multiple accessible binding 
sites* many of which have nothing to do with catalytic activity. 
It is not dear at this early stage how significant an issue mul- 
tiple binding sites will be. However, it is worth noting that 
Shuker et aL (42) took advantage of a second binding site to 
increase the affinity of an inhibitor for the protein. Ultimately,* 
of course, affinity ligands must be shown to inhibit cell growth, 
that is, to have antimicrobial activity. Some chemical engineer- 
ing of the compound may be required to increase microbial 
uptake. 

A third approach for assaying gene products of unknown 
function relies on the complex gene expression regulatory net- 
work found in many bacteria. Expression levels of genes in 
metabolic pathways are often regulated in response to the 
amounts of intermediates in the celL For example, disruption 
of the general secretory pathway in £ coli by mutation results 
in dramatic up-regulation of secA gene expression (37). Alksne 
et al. (2) took advantage of this fact to build a strain of £ coli 
carrying a secA-lacZ fusion as a detectable reporter. Several 
synthetic compounds and natural products were identified by 
their ability to induce expression of the reporter. Many of these 
exhibited antimicrobial activity and reduced the secretion of 
Staphylococcus aureus toxin 1. Similarly, Mdluli et al. (34) have 
reported that sublethal concentrations of isoniazid lead to up- 
regulation of the kasA and acpM genes. This group has initi- 
ated a whole-cell, high-throughput screen of chemical com- 
pounds which induce expression of a lucif erase reporter fused 
to a gene in this regulated pathway. Screens of this type, which 
take advantage of the bacterial gene regulatory network, are 
inherently less specific than the two other types described here. 
In addition, they suffer from the basic limitation of all whole- 
ceU screens: compounds must be capable Df entering' the cell in 
order to be detected. However, these types of screens offer the 
potential advantage of identifying compounds which act at any 
of several points in a pathway. 

CONCLUSIONS 

The availability of genomic sequence information for all or 
nearly all of several different bacterial species provides impor- 
tant new advantages for target discovery. First, it p ermits use of 
a comparative genomic analysis to identify potential new tar- 
gets shared across several bacterial species or particular to a 
single species. In this manner, it is possible to gen. erate lists of 
genes which represent potential targets for broad— spectrum or 
highly focused narrow-spectrum antibiotics. Segue* xice compar- 
isons can also provide some assurance against mammalian 



Vol. 43, 1999 



MINI REVIEW 445 



toxicity if proteins of similar sequence do not exist in mamma- 
lian sequence databases. Second, sequence similarity provides 
6ome insights into putative functions for most gene products. 
Finally, availability of the entire sequence of the gene target of 
interest permits rapid construction of gene knockouts to vali- 
date the utility of the target and facile construction of expres- 
sion plasmids for production of protein and development of 
assays. The fact that bacterial and fungal genes can be assessed 
rapidly for their relevance as potential antibiotic targets by 
determining the effect' of knocking out the gene and the fact 
that their genomes are small enough to be sequenced in their 
entirety are* compelling reasons that the field of genomics will 
likely find its first real utility in the development of new anti- 
microbials. 



ACKNOWLEDGMENTS 

We thank our coDeagues at Genome Therapeutics Corporation aud 
the Schering-Plough Research Institute for helpful discussions about 
genomic approaches to drug discovery. In particular, Skip ( Shimer, 
Brad Guild, and Lucy Ling were instrumental in the analysis of the 
approaches summarized here. "We thank Douglas Smith of Genome 
Therapeutics Corporation for the compilation of Internet resources 
presented in Table 3. 

REFERENCES 

L Akerlay, B. J„ E. J. Ruble, A. CamlDI, D. J. Lampe, H. M Robertson, and 
J. J. Mekalnnos. 1MB. Systematic identification of essential genes by in vitro 
mariner mutagenesis. Proe Natl Acad. ScL USA 95-.B927-B93Z 

Z AUcsne, LK, P. Burgio, P. Bradford, B. Feld, W. Hu, P. Lebthaviknl, M. 
McGlynn, P.J. Petersen, M Tnctannn, and S. Projan. 1998. Identification of 
Inhibitors of bacterial secretion by using a SecA reporter system, p. 272. in 
Abstracts of the 3Btb Interscience Conference on Aatimicrobial Agents and 
QiemDtberapy. American Society for Microbiology, Washington, D.C 

3. Aim, R. A, L X. line, D. T. Moir, B. U Eng. E. D. Brown, P. C. Dolg, D. R. 
Smith, B. Noonaa, B. C Guild, B. 1- dejonge, G. Carmel, P. J. Tttmraino, A. 
Caruso, M. Orie-NJcfcelsen, D. M. Mills, C. Ives, R. Gibson, D. Merberg, 
S. D. Mills, Q. Jiang, D. IL Taylor, G. F. Vovis, nnd T. J. Trust 1999. 
Genomfc-sequenee comparison of two unrelated isolates of the human gas- 
tric pathogen Helicobacter pylori. Nature 397;176-1&U 

4. Andersson, S. G.1L, A Zomorodlponr, J. O. Andersson, T. Slcherltz-Ponten, 
0. C, M. Akmarh, R. M. PodowsW, A. K. Noeslund, A-S. Eriksson, H. H. 
Winkler, and C G. Kurland. 199S. The genome sequence of Rickettsia 
prowaxeh and the origin of mitochondria. Nature 396033-340. 

5. Anld, Y, P. Yosalhara, M. Kondoh, Y. Nakamura, N. Nakoyaroa, and M 
Arisewa. 1993. Ro 09-1470 is a selective inhibitor of P-450 laaosterol C-14 
demetbylase of hmgl Amimlerob. Agents Chemother. 37:2662-2667. 

6. Arigonl, F n F. Talabot, M. Peltsch, M D. Edgerton, E. Meldram, R. AHet, R. 
Fish, T. Jamotte, M^L. CarchDd, and H. Loferer. 199S. A genome-based 
approach for the identification of essential bacterial genes. Not Biotechnol 
16-.B51-S56. 

7. Blattnar, F. R^ G. Plunkett, C A. Bloch, N. T. Perna, V. Borland, M Riley, 
J. Conado-Vldes, J. D. Glosner, C K. Rode, G. F. Maybew, J. Gregor, N. W. 
Davis, H. A Klrfcpatrich, M. A. Goeden, D. J. Rose, B. Man, end Y. Shac, 
1997. The complete genome sequence at Escherichia coH K-12. Science 277: 
1453-1462. 

8. Bult, C. J n 0. White, G. J. Olsen, L Zhou, R. D. Flelschmnnn, G. G. Sutton, 
J. A. Blake, L. M. FltzGerold, R. A. Clayton, J. D. Gocnyne, A R. Kerlavage, 
B, A. Dougherty, J. F. Tomb, M. D. Adams, C. L Reich, R. Overbeek, E. F. 
Klrkness, X. G. Weinstock, J, M. Merrick, A Glodek, J. L Scott, R S. M. 
Geogbngen, and J. C Venter. 1996. Complete genome sequence of the 
methanogenic archaeon, Methanococcus jannaschlL Science 273:1058-1073. 

9. Cole, S. T„ R. Brosch, J. Parkhill, T. Garnler, C. Church er, D. Harris, S. V. 
Gordon, K. Elglmeier, S. Gas, C R.. Barry m, F. Tekaia, K. Bodeock, D. 
Banhnm, D. Brown, T. ChiHlngworth, R. Connor, R. Dovles, K. Devlin, T. 
Feltwell, S. Gen lies, N. Hamlin, S. Holroyd,T. Hornsby, K. Jsgels, A Krogh, 
J. McLean, S- Moule, U Murphy, K. Oliver, J. Osborne, M. A Quoll, JVl-A 
Rajandream, J. Rogers, S, Rutter, K. Seeger, J. Skelton, R. Squnres, S. 
Squares, J. K. Sulston, K. Taylor, S. Whitehead, oad B. G. BamJL 1998. 
Deciphering the Wpbgy of Mycobacterium tuberculosis from the complete 
genome sequence. Nature 393:537-544. 

ID. Deckert, G n P. V. Warren, T. Gaasterland, W. G. Young, A L Lenox, B. TL 
Graham, R. Overbeek, M. A. Snead, M Keller, M. Aujay, R. Hoberk, TL A. 
Feldman, J. M. Short, G. J. Olsen, and R* V. Swanson. 199B. The complete 
genome of the byperthennophilic bocterium Aqutfex aeolicus. Nature 393: 
353-358. 

1L de Lorenzo, V n L Eltls, B. Kessler, and K. N. Tunmls. 1993. Analysis of 



Pseudomonas gene products using lacrV?trp4ac plasmids and transposons 
that confer conditional pbenorypes. Gene 123:17-24. 

12. DeRisl, J. U, V. R. Iyer, and P. O. Brown. 1997. Exploring the metabolic and 
genetic control of gene expression on a genomic scale. Science 27B:6B0-6B6. 

13. Flelsebmann, R. D., M. D. Adams, O. White, R. A Clayton, E. F. Klrkness, 
A R. Kerinvage, C J. Bult, J. F. Tomb, B. A Dougherty, J. M. Mtrrick, 
K. McKenney, G. Sutton, W. FltxHugh, a Fields, J. D. Gocoyne, J, Scott, 
R. Saiiriey, X. Uu, A. Glodek, J. M. KeUey, J. F. Weinman, C. A- Phillips, 
T. Spriggx, R. Hedblom, M. D. Cotton, T. R. Utterback, M. C Hanna, 
D. T. Nguyen, D. M. Sandek, R. C Bmndon, L. D. Fine, J. L. Fritchmaa, J. L 
Fuhmann, N.S.M. Geoghagen, C L Gnehm, L. A McDonald, K. V. Small, 
C M. Froser, H. O. Smith, and J. C. Venter. 1995. Wbole-genoroe random 
sequencing and assembly of Haemophilus influenzae Rd. Science 269:496- 
512. 

14. Froser, C. M, S. Casjens, W. M Huang, G. G. Sutton, R- anyton,* R. 
Lothigra, O. White, K. A Ketch um, R. Dodson, H K. HJckey, M. Gwinn, 
B. Dougherty, J.-F. Tomb, R. D. FleUcbmann, D. RJchardson, J. Polersoa, 
A R. Kerlavage, J. Quackenhush, S. Sateberg, M. Hanson, R. van Vugt, N. 
Palmer, M. D. Adams, J. Gocoyne, J. Weinman, T. Utterback, L Watthey, L. 
McDonald, P. Artiach, C Bowman, S. Garland, C Fnjll, M. D. Cotton, X. 
Horst, K. Roberts, B. Hatch, H. O. Smith, and J. C Venter. 1997. Genomic 
sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 39th 

. 580-586. 

15. Fraser, CM,J,D. Gocoyne, O. White, M. D. Adams, R- A. aaytoa, R. D. 
Flelschmann, C J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kellej, J. L. 
Fritchmaa, J. F. Weidmah, K. V. Small, M. Sandusky, J. Fubrtnann, D. Ngu- 
yen, T. R. TJetarbnek, D. M Seudeh, C A Phillips, J. M. Merrick, J.-F. 
Tomb, B. A Dougherty, K. F. Bott, P.-C. Hu, T. S. Lucler, S. N. Pttereon, 
H. O. Smith, a A. Hutchison III, and J. C Venter. 1995. The minimal gene 
complement of Mycoplasma genUalhun, Science 270:397-403. 

16. Fraser, C M, S, J. Norris, G. M. Weinstoek, a White, G. G. Sutton, 
R. Dodson, M. Gwinn, R. K. HJckey , R. Clayton, K. A Katchum, E. Soder- 
gran, J. M. Hardham, M. Pi McLeod, S. Salzberg, J. Peterson, H. KhBlak, 
D. RJchardson, J. K. Hovrell, IVL Chidambaram, T. Utterback, L McDonald, 
P. Artiach, C Bmrmon, M. D. Cottoa, J. C Venter, et nl. 199B. Complem 
genome sequence of Treponema pallidum, the syphilis spirochete. Science 
281:375-388. 

17. Goffean, A, B. G. Barrel!, H. BusBey, R. W. Davis, B. Dujon, H. Feldmann, 
F. Gallbert, J. D. Hohelsel, C Jacq, M. Johnston, £. J. Louis, H. W. Mewes, 
Y. Murakami, P. PhillppBen, H. Tettelln, and S. G Oliver. 1996. life with 
5DO0 genes. Science 274:546-567. 

18. Gold/H, and R. C Moellering. 1996, AntimJcrobiaJ-drug resistance. 
N. EngL J. Med. 335:1445-1453. 

19. Gyuris, J^ S. Golemls, H. Chertkov, and R. Brent 1993. CdU, a human Gl 
and S phase protein phosphatase that associates with Cdk2. Celi 75:791-603. 

20. Hamiltoa, C M^M. Aidea, B. K. Washburn, P. Babltzke, nnd S. R- Ku. 1989. 
New method for generating deletions and gene replacements in Escherichia 
coll, J. BacteriDL 1715461 7-4622. 

ZL HeitholT, D. M-, C P. Conner, P. a Henna, S. M Julio, U. Hentscbel, and 
M J. Mohan. 1997. Bacterial infection us assessed by in vivo gene expression. 
Proa NatL Acad. Set USA $4:934-939. 

22. Hensel, M, J. R. Shea, C Gleeson, M D. Jones, £. Dalton, and D. W. 
Holden. 1995. Simultaneous identification of bacterial virulence genes by 
negative selection. Science 269:400-403. 

23. Hlmmelreieh, R-, B. Hflbert, H. Plogens, R. PirW, B. C. U, and R. Herr* 
manri. 1996. Complem sequence analysis of the genome of the bacterium 
Mycoplasma pneumoniae. Nucleic Adds Res. 24:4420-4449. 

24. Hinton, J. C D. 1997. The Escherichia coU genome sequence: the end of an 
era or the start of the FUN? MdL Microbiol 26:417-422. 

25. Hoshino, K, K* Sato, T. Una, and Y. Osada. 19B9. Inhibitory efEects of 
quinolones on DNA gyrase of Escherichia toU and topoiso me rase H of fetal 
calf thymus. Antimicrob. Agents Chemother. 33:1816-1818. 

26. Jena, C JL, M. Zhang, M. WiekowsW, J. C. Tan, X. D. Fan, V. Hegde, M. 
Patel, R. Bryant, S. K. Narula, P. J. Zavodny, and C C Chon. 1998. 
Development of a CD28 receptor binding-based screen and identification of 
a biologically active inhibitor. Anal Blochem. 256:47-55. 

27. Kaneko, T n S. Soto, H. Kotani, A Tanakn, £. Asamita, Y. Noicaroura, N. 
Mlyajlma, M. Hlrosawa, M. Sugiura, S. Sasamoto, T. Dmura, T. Hosouchl, 
A. Matsuno, A. MuraoJ, N. Nokazald, K. Naruo, S. Oknmura, S. Shimpo, C 
Takeuchl, T. Wada, A Watanabe, M. Yaroada, M YasudLn, and S. Tabata. 
1996. Sequence analysis of the genome of the unicellular cynno bacterium 
Synechocystis sp. strain PCC6603. H. Sequence demrmiriExtion of the entire 
genome and assignment of potential protein-coding regions. DNA Res. 3: 
109-136. 

2B. Kawaraboyasi, Y., M. Sawada, H. Horlke\ra, Y. Halkawa, "V. Hlno, S. Yama- 
roota, yi SeWne, S. Baba, H. Kosagl, A Hosoyamo, Y. Nogal, M. Sakal, K. 
Ogura, R. Otsuka, H. Nnkazaw, M Takamlya, Y. Ohftikn, T. Fuaahasbl, 
T. Tanaka, Y. Kudoh, J. Yataozakl, N. Rushlda, A Ogucrii, K. Aold, oad H. 
Klkuchi 1998. Complete sequence and gene organization of the genome of 
a hyper-tbermophiiie archaebacterium, Pyrococcus horiAcoshli OT3. DNA 
Res. 5(SuppL):147-l55. 

29. Klenk, H.-P., R. A aaytoa, J.-F. Tomb, O. White, K. E. Nelson, K. A. 



446 MINIREVIEW 

3(1 S^S TOl , ,, 'l a^ Ubbml 1995. Dimmkbed activity of the ft* 

P^r^ Swr * P«K"°<yP« b yeort Biochim. Bio- 

32. Link, A. J., D. Phillips, ond C. M Church. 1997 M t *h n A. ^ 

pr&cke deletion* and Snw^i^. -.1 »l McLbodl generatine ' 

39. Salyers, A. and C F. Ajnebllelcuerox. 1007 urw v nr ^ R . 
39a.ScbinJd, M. TcnonnJ communication. 



Akijwcrob. Agbots Chemother. 

ccvnrhg hijb-^nity Ognndr for protein* SAR by NMR. Science 27*153,. 



a pat£n, s. iSbS ■ s^ES' N - J ^ a ( . n, - A - C «"'<>.>>-Ba*h 1 as55 

179)7135-7155. ^ nB ""Pomove geoomiet. J. BeeteiioL 

44. Smith, V, D. Botrteta, and P. o w io»e n . 

printing. Science 274^065-2074 ttromasome V ^ Scnetic foot- 

1998. Genome leWenee of m nhw : £ J CoDn ' n ' ,nd *• W ' 

tberit of the fe3-SuS^? 861,6 cass " te re <^° for bioryn- 
ogy, Weshiogonfo.C W> A"™™ Sooietj for Mfaobiol- 

P. D. Karp, H. 0. Sml* SftS ^o^C vtS^V 0 "^' 
genome teqnejee of toe nttrie r^ThrZZ wr y l 1P ? 7, 7116 ">=>p!em 
539-542. *^ patbogea HtCeebatter pylori. Nature 388: 

51 and N. J. Moreao. 

ferring «S« m ^^f^^^f < E«h«teh|. coB coo- 
41:85-90. ^ OT "n^'otio. Amanlcrob. Ageou Qemother. 

Smell peptides TZ^l mta'etoS^ t Jo ?^ DBd W J " Dovet "M. 
SdencS 273^tt M ' ete 81 46 P"" 61 " "aytbrnpoietii 



A Genomic Perspective on 
Protein Families 

Roman L Tatusov, Eugene V. Koonin,* David J, Lipman 

In order to extract the maximum amount of information from the rapidly, accumulating 
genome sequences, all conserved genes need to be classified according to their ho- 
mologous relationships. Comparison of proteins encoded in seven complete genomes 
from five major phylogenetic lineages and elucidation of consistent patterns of sequence 
similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each 
COG consists of individual orthologous proteins or orthologous sets of paralogs from at 
least three lineages. Orthologs typically have the same function, allowing transfer of 
functional information from one member to an entire COG. This relation automatically 
yields a number of functional predictions for poorly characterized genomes. The COGs 
comprise a framework for functional and evolutionary genome analysis. 



The release in 1995 of the complete geO 
nome sequence of the bacterium Haemophi- 
lus influenzae (1), followed within the next 
1.5 years by four more bacterial genomes^ 
(2), one archaeal genome (3), and one gel) 
nome of a unicellular eukaryote (4), marked 
the advent of a new age in biology. The 
hallmark of this era is that comparisons 
between complete genomes are becoming 
an indispensable component of our under U 
standing of a variety of biological phenomO 
ena. The number of sequenced genomes is 
expected to grow exponentially for at least 
the next few years, and conceivably, their 
impact on biology will further increase (5). 

Knowing the inventory of conserved^ 
genes responsible for housekeeping funcO 
tions and understanding the differences in 
the genetic basis of these functions in difO 
ferent phylogenetic lineages is central to 
understanding life itself, at least at the level 
of a single cell. Complete sequences are^ 
indispensable for achieving this goal beO 
cause they hold the only type of infbrmaO 
tion that can be used to delineate the comO 
plete network of relationships between 
genes from different genomes. Furthermore, 
only with complete genome sequences is it 
possible to ascertain that a particular proO 
tein implicated in an essential function is 
not encoded in a given genome. According 0 
ly, an alternative protein for the respective^ 
function should be sought among the funcO 
tionally unassigned gene products (6). With 
multiple genome sequences, it is possible to 
delineate protein families that are highly 
conserved in one domain of life but are 
missing in the others. Such information 
may be critically important: For example, 

The authors are with the National Center tor Biotechnol- 
ogy Information, National Library of Medicine, National 
Institutes of Hearth, Bethesda, MP 2D894, USA. 

*To whom requests for reprints should be addressed. 
E-mail koonin@ncbi.nJm.nlh. gov 



the families that are conserved among bacu 
teria but are missing in eukaryotes comprise^ 
the pool of potential targets for broad QpecD 
trum antibiotics. 

The knowledge of all of die gene seO 
quences from multiple complete genomes 
redefines the problem of gene classification. 
It becomes feasible to replace the more or 
less arbitrary clustering of genes by similarU 
icy with a complete, consistent system in 
which the groups are likely to have evolved 
from a single ancestral gene. Such a natural^ 
classification of genes will provide a frame D 
work for evolutionary studies and for rapid, 
largely automatic functional annotation of 
newly sequenced genomes. This framework 
will evolve and improve with increasing 
coverage of the diversity of life forms with 
complete genome sequences. It is critical to 
have this system in place while the number 
of completed genomes is still small and each 
family can be explored individually. Here 
we describe a prototype of a natural system 
of gene families from complete genomes. 

Orthologs and Paralogs: Deriving 
Clusters of Orthologous Groups 

The relationships between genes from difO 
ferent genomes are naturally represented as 
a system of homologous families that inO 
dude both orthologs and paralogy OrO 
thologs are genes in different species that 
evolved from a common ancestral gene by 
speciation; by contrast, paralogs are genes 
related by duplication within a genome (7) v 
Normally, orthologs retain the same funcO 
tion in the course of evolution, whereas^ 
paralogs evolve new functions, even if reO 
lated to the original one. Thus, identifies 0 
tion of orthologs is critical for reliable preO 
diction of gene functions in newly seO 
quenced genomes. It is equally important^ 
for phylogenetic analysis because interpretO 



able phylogenetic trees generally can be ] 
constructed only within sets of orthologs ' 
(8). A complete list of orthologs also is a 
prerequisite for any meaningful comparison 
of genome organization (9). 

A naYve operational definition would 
simply maintain that for a given gene from 
one genome, the gene from another genome 
with the highest sequence similarity is the^ 
ortholog. Given the complete genome seO 
quences, this straightforward approach ofO 
ten gives credible results, especially when 
the compared species are not too distant 
phylogenetically (9). At larger phylogenetic 
distances, however, the situation becomes 
more complicated. If gene duplications ocO 
curred in each of the given two clades subO 
sequent to their divergence, only a manyO 
toGhany relationship will adequately deO 
scribe orthologs, and accordingly, detection 
of the highest similarity will not result in ' 
the identification of the complete set of 
orthologs* In addition, when the best hit is 
not highly significant statistically, which is 
common in the case of phylogenetically 
distant relationships (JO), it simply may be 
spurious. On the other hand, attempts to 
apply a restrictive similarity cutoff are likely 
to result in a number of orthologs being 
missed. 

Given the existence of oneQoukany and 
manyubbhany orthologous relationships^ 
we redefined the task of identifying orO 
thologs as the delineation of clusters of 
orthologous groups (COGs). Each COG 
consists of individual orthologous genes or 
orthologous groups of paralogs from three or 
more phylogenetic lineages. In other words, 
any two proteins from different lineages 
that belong to the same COG are orthologs. 
Each COG is assumed to have evolved frotn^ 
an individual ancestral gene through a sell 
ries of speciation and duplication events. 

In order to delineate the COGs, all pairO 
wise sequence comparisons among the 
17,967 proteins encoded in the seven comO 
plete genomes were performed ( J i ), and for 
each protein, the best hit (BeT) in each of 
the other genomes was detected. The idenO . 
tification of COGs was based on consistent 
patterns in the graph of BeTs. The simplest 
and most important of such patterns is a^ 
triangle, which typically consists of orO 
thologs (Fig. 1A). Indeed, if a gene from 
one of the compared genomes has BeTs in 
two other genomes, it is highly unlikely that 
the respective genes are also BeTs for one 
another unless they are bona fide orthologs^ 
(12). The consistency between BeTs result 0 
ing in triangles does rxot depend on the 
absolute level of similarity between the^ 
compared proteins and chus allows the deO 
tection of orthologs among both slowly and 
quickly evolving genes- This approach is 
most likely to be informative when the 



www.sciencemag.org • SCIENCE • VOL 278 • 



Exhibit 



631 



BeTs forming a triangle come from widely 
different lineages. Accordingly, only five 
major, phylogenetically distant clades were 
used as independent contributors to COGs: 
GramGiegative bacteria (Escherichia coU and 
H. influenzae), Gram positive bacteria (My- 
coplasma gerdtalium and M. pneumoniae), 
Cyanobacteria (Synechocystu sp.), Archaea 
(Euryarchaeota) (Merhanococcus jannaschii), 
and Eukarya (Fungi) (Saccharomyces cerevi- 
siae)(13). n 
The procedure used to derive COGs inU 
eluded finding all triangles formed by BeTs 
between the five major clades and merging 
those triangles that had a common side 
until no new ones could be joined. A triO 
angle is an elementary, minimal COG (Fig. 
1A). The groups produced by merging adU 
jacent triangles include orthologs from difU 
ferent lineages and, in many cases, paralogs 
from the same lineage (Fig. 1, B and C). 
Because of the existence of paralogs, the^ 
BeTs that form the triangles are not necesO 
sarily symmetrical: For example, in the 
COG shown in Fig. 1C, the same M. geni- 
talium protein, MG249, is the BeT for four 



Fig. 1. Examples of COGs. Solid lines show sym- 
metrica] BeTs. Broken lines show asymmetrical 
BeTs, with color corresponding to the species lor 
which the BeT is observed. Genes from the same 
species are adjacent otherwise the gene names 
are positioned arbitrarily. A unique COG ID Is indi- 
cated in the upper tell comer. (A) Congruent BeTs 
form a triangle, the minimal COG. Origin of the 
proteins: KatG, £ colt; SII19B7, Synechocystis 
sp.; and YKR066c. S. cerevislae. Note that all the 
BeTs are symmetrical. (B) A simple COG with two 
yeast paralogs. Origin of the proteins: lleS, £ cofr, 
HIN037B, hi influenzae] MG345. M. genltalium; 
MP322. M. pneumoniae; MJ0947, M. jannaschii; 
and YBL076C and YPL040c, S. cerevislae. Note 
the adjacent triangles with a common side, for 
example, UeS-MG345-MJ0947 and sM362- 
MG345-MJ1362."YPL040c is the yeast mito- 
chondrial isoleucyl-tRNA synthetase; the bacteria! 
orthologs and that from M. jannaschii are the 
BeTs for this yeast protein, but the reverse is true 
only of the bacterial proteins (symmetrical BeTs). 
Conversely, for YBL076c, which is the yeast cyto- 
plasmic Isoleucyl-tRNA synthetase, the M. Jann- 
aschii ortholog is a symmetrical BeT, whereas the 
bacterial BeTs are asymmetrical. (C) A complex 
COG with multiple paralogs. Origin of the proteins: 
RpoH, RpoS, RpoD, and FliA. £ colt, HIN1403 
and HIN1655. H. influenzae; MG249, M. genf- 
taSum; MP4B5, M. pneumoniae; sl!01 B4, s!l0306, 
slrOS53, SD16B9, SI12012, and str1554, Synecho- 
cystis sp. RpoD, HIN1655, slr0653, and MG249 
are major sigma factors ((r70), whose function is 
universal in bacteria; note the fully symmetrical 
relationships between these proteins. The other 
proteins are specialized sigma factors whose ra- 
diation from the ancestral family apparently was 
accompanied by modification of the function and 
involved accelerated evolution; note the asym- 
metrical BeTs. 



paralogous a sub-units of E. coli RNA polyO 
merase, but only for one of them, RpoD, is 
the relationship symmetrical. 

Most of the clusters derived by the above 
procedure meet the definition of a COG, 
that is, all of the proteins from the different 
lineages in the same cluster are likely to be^ 
orthologs. There are, however, several reaO 
sons why, in certain cases, COGs may be 
lumped together. Proteins may contain two 
or more distinct regions, each of which 
belongs to a different conserved family; usuCI 
ally such proteins are loosely referred to as 
multidomain (14). Each of the clusters was 
inspected for the presence of multidomain 
proteins, individual domains were isolated 
(J5), and a second iteration of the sequence^ 
comparison was performed with the resultO 
ing database of domains.. Some of the COGs 
may include proteins from different lineages 
that are paralogs rather than orthologs, priU 
marily because of differential gene loss in 
the major phylogeneric lineages. When one 
gene in a pair of paralogs is lost in one 
lineage but not in the others, two COGs^ 
that should have been distinct may be artiO 




ficially joined. Therefore, the level of seU 
quence similarity between the members of 
each cluster was analyzed, and clusters that 
seemed to contain two or more COGs were 
split. 

Phylogenetic and Functional 
Patterns in COGs 

The described analysis resulted in 710 apO 
parent COGs. This set appears to be essenO 
tially complete as far as orthologous relaO 
tionships are concerned. Indeed, when the 
portion of the database of proteins from 
complete genomes not included in the 
COGs was clustered by sequence similarity 
. (16), only 10 groups were identified, which, 
upon careful inspection of the alignments,^ 
were considered likely to constitute addiU 
tional COGs missed originally. These^ 
groups were incorporated, producing the fill 
nal collection of 720 COGs, including 6814 
proteins and distinct domains of multidoU 
main proteins (6646 distinct gene products, 
or 37% of the total number of genes in the 
seven complete genomes) (J 7). 

Most of die COGs are relatively small 
groups of proteins. Onelihird of the COGs 
(240 COGs with 1406 proteins) contain 
one representative of each of the included 
species (no paralogs), and 192 more COGs 
include paralogs from only one species, 
most frequently yeast (87 COGs). The 
mean number of proteins per COG in ere as U 
es with increasing number of genes in a 
genome, from 1.2 for M. gerdtohum to 2.9 
for yeast. A notable aspect of many COGs is 
the differential behavior of paralogs. It is^ 
typical that one of the paralogs, for examO 
pie, in yeast, shows consistently higher simO 
ilarity to the orthologs in all or most of the^ 
other species (Fig. 1, B and C). For numerU 
ous yeast paralogs, particularly components 
of the translation apparatus, the underlying 
cause is obvious: the gene whose product is 
most similar to the bacterial orthologs is of 
mitochondrial origin (Fig. IB). A more 
common explanation for che asymmetry of 
the relationships in the COGs, however, is^ 
that the highly conserved paralog has reU 
tained the original function, whereas the 
functions of the less conserved paralogs 
have changed in the course of evolution. In 
the already considered example (Fig. 1C), 
the symmetrical component of the graph 
(solid lines) delineates the conserved funcO 
tion of the a70 subunit of the RNA polyO 
merase (E. coli RpoD), which is required for 
the transcription of the bulk of bacteria^ 
genes, whereas the asymmetrical BeTs (broil 
ken lines) are observed for cr subunits (E. 
coli RpoH, RpoS, and FliA. ) involved in the 
transcription of specialized gene subsets 
(]8). This phenomenon appears to be 
widespread, as we found 54 9 proteins in 302 



632 



SCIENCE • VOL 278 • 24 OCTOBER 1997 • www.sciencemag.org 



COGs whose corresponding paralogs^ 
showed consistently lower similarity to othO 
er members of the COG. One may think of 
the rapidly evolving paralogs as progenitors 
of new families emerging from within the^ 
conserved ones. The COGs will be an irnLI 
portant resource in a systematic survey of 
the functional diversification of paralogs in 
conserved gene families. 

There are several large clusters in the^ 
current collection with complex relationO 
ships between members. Two of these^ 
namely the adenosine triphosphatase (ATU 
Pase) components of ABC transporters and 
histidine kinases, each include over 100 
members. It is likely that subsequent deU 
tailed analysis of these large groups (for 
example, by phylogenetic tree methods) 
will result in their split into several distinct 
COGs, especially when more genomes are 
available. On a more general note, CXXte 
do not supplant traditional methods of phyU 
logeneric analysis but rather provide the 
appropriate starting material for these^ 
methods, in particular for a systematic analU 
ysis of phylogenetic tree topology. 

Figure 2 shows the breakdown of the 
COGs by broadly defined function ( J 9) and 
by species (20). For the majority of the 
COGs, the protein function is either known 
from direct experiments, mainly in E. coli or 
yeast, or can be confidently inferred on the 
basis of significant sequence similarity to 
functionally characterized proteins from 
other species. It has to be emphasized that 
construction of the COGs includes autoO 
matic prediction of the function for numerO 
ous genes, particularly from the poorly charO 
acterized genomes such as M. jannaschii. 
There is, however, a substantial fraction of 
the COGs (14%) for which only general^ 
functional prediction, typically of biochemU 
ical activity, but not the actual cellular role 
could be made, and for another 5%, there 
was no functional clue (Fig. 3). Each of the 
COGs includes proteins from at least three^ 
major clades whose divergence time is estiO 
mated to be over a billion years (2J), that 
is, they all are ancient, conserved families 
with important, if not necessarily essential, 
cellular functions. Therefore, the proteins 
belonging to the "mysterious" COGs are 
good candidates for directed experimental 
studies. 

The distribution of proteins from differU 
ent species in the COGs shows several 
trends (Fig. 2), although the bias in the 
current collection of complete genomes (in 
particular, because three lineages are reU 
quired to form a COG, all COGs had to 
have a bacterial member) must be taken^ 
into account when interpreting these comll 
parisons. The fraction of proteins belonging 
to COGs is greatest in the nearly minimal 
genomes of mycoplasmas (70% for M. gem- 



tatium) and much lower in the larger geU 
nomes of E. coH and yeast (40% and 26%, 
respectively), which indeed is the tendency 
expected of conserved families presumably 
associated with cellular housekeeping funcQ 
tions. The genes of the pathogenic bacteria 
(H. influenzae and two mycoplasmas) are 
essentially subsets of the two larger bacterial 
gene complements, E. cdii and Synechocystis 
sp. The latter two species almost always 
co&ccur in the COGs. The main cause of 
the observed congruency is likely to be the^ 
conservation of the core of ancestral bacteU 
rial genes in nonparasitic species from difU 
ferent major clades. Accordingly, the fact 
that proteins from the pathogenic bacteria^ 
are missing in many COGs most likely tesO 
tifies to gene loss, which has been extensive 



1ART1CLES 



even in this subset of highly conserved 
genes. The cooccurrence of M. jarmaschii 
in a COG with E. coli or Synechocysw is 
measurably more frequent than that with 
yeast (Fig. 2). Such a distribution of the 
archaeal genes appears to be due primarily^ 
to the blending of bacterial Gfcke and eukaryU 
oticfllike genes in the archaeal genomes 
(JO), although the mentioned bias in the 
genome collection is also a factor. 

The phylogenetic distribution of the 
COG members is distinct for different funcO 
tional classes (Fig. 2). It is not unexpected 
that translation is the only category in which 
ubiquitous COGs are predominant. Another 
obvious trend is the absence of proteins frorn^ 
pathogenic bacteria (H. mftu&nzae and, parO 
ticularly, the mycoplasmas) in many COGs 



Translation, liposomal structure, and biogenesis 




Amino acid 
98 metabolism 
993 and 

transport 



Fig. 2. A functional and phylogenetic breakdown of the COGs. E indicates £ coli\ H, H. influenzae; G, 
M. genitarium; P, M. pneumoniae-, C, Synechocystis sp.; M, M. jannaschfi; and V, S. cerevisiae. Each 
column shows a COG; a double streak indicates that two or more paralogs from the given species 
belong to the particular COG. The number of COGs (numerator) and the'number of proteins in them 
(denominator) is Indicated for each functional category. Capital letters in the leftmost field encode the 
functional categories (usBd in the COG IDs). 



www.sciencemag.org • SCIENCE • VOL 278 • 24 OCTOBER 1997 



633 



in each functional category other than trans U 
lation and transcription, but especially in the 
metabolic functional classes. Conversely, the 
congruence between the two nonparasitic 
bacteria, E. coli and Synechocysns sp., hold\ 
for all functional classes (Fig. 2). Also apparU 
ent is the differential appearance of archaeal^ 
proteins that tend to group with yeast proU 
teins in the translation and transcription 
classes (which, given the bias in the genome 
collection, results in ubiquitous COGs) but 
in all other functional classes are frequently 
found in ODGs with bacterial proteins only. 

The phylogenetic distribution of GOG 
membership can be conveniently presented in 
terms of "phylogenetic patterns" which show 
the presence or absence of each analyzed spell 
cies (Fig. 3). Of the 88 patterns that include at 
least three lineages (the definition of a COG), 
36 were actually found. Missing were mostly 
patterns with only one of the two species of 
Mycoplasma, which was predictable because^ 
the gene complement of M. gerntaHum is esO 
sentially a subset of the M. pneumoniae comU 
plement (22). The remaining eight patterns^ 
that were never observed all include pathoU 
genie bacteria without E. coli, which is rhe^ 
largest and most diverse of the available bacl) 
terial genomes. The two most abundant patU 
terns could easily be predicted: all species 
("ehgpcmy"), and all species except for the 
mycoplasmas ( M eh_cmy"). What appears^ 
much less trivial is that these patterns togethll 
er encompass only onelihird of all OOGs. 
This fact emphasizes the remarkable fluidity 
of genomes in evolution, revealed in spite of 
the fact that the analysis concentrated on 
ancient conserved families. Multiple solutions^ 
for the same important cellular function apU 
pear to be a rule rather than an exception, at 
least when phylogenetically distant species are 
considered (JO, 23). On the other hand, the 
eight most frequent patterns, which together _ 
account for 85% of the COGs, all include 
both E. coli and Synechocysus, emphasizing the 
congruency between these genomes. 



The 1 14 ubiquitous COGs, most of them 
including components of the translation and 
transcription machinery, form the universal 
core of life, This set is more than twofold^ 
down from the bacterial "minimal set" conO 
sisting of 256 genes (23), but significant 
further erosion seems unlikely, given the 
broad spectrum of compared genomes. 

The higher order distribution of the 
COGs by the three domains of life, with 
only 45% of the COGs including represent) 
tatives of Bacteria, Archaea, and Eukarya, is 
another manifestation of the dynamics of 
gene families in evolution (Fig. 3). The 
picture is expected to become even more 
complex, and the fraction of threeliomain 
COGs will probably drop, once archaealO 
only, eukaryotic&nly, and archaeallindQuO 
karyotic COGs emerge with the accumulaO 
tion of genome sequences. 

The unusual, rare patterns are of particO 
ular interest, suggesting the possibility of 
unexpected findings. Each of the COGs 
with patterns that occur only once in our 
current collection (Table 1) should correU 
spond to a unique function scattered over 
disconnected branches of the tree of life. 
Why such functions are conserved and are 
presumably important for survival in some 
but not other lineages is a challenge to be 
addressed experimentally. The principal^ 
evolutionary mechanisms that can be inU 
voked to explain the emergence of these 
rare patterns are differential gene loss and 
horizontal transfer of genes. Some of the 
functions involved, for example, lipoateO 
protein ligase and glycyl-transfer ribonucleic 
ase (tRNA) synthetase, appear to be strictly 
essential, but in' different species, they are 
performed by two distinct sets of orthologs^ 
unrelated to one another (24). Other funcO 
tions, for example, thymidine phosphorylU 
ase and hexuronate dehydrogenases, may be^ 
dispensable under most conditions, and acu 
cordingly, differential gene loss is likely; it is 
remarkable, however, that these functions 



Bacteria+Eufcarya Bacteria+Euharya Bacteria+Archaea 
+ Archaea 



Bacteria only 





COGS 


Pattern 


COGS 


Pattern 


COOH 


Pattern COGs 




11111 








15 


wTwn 


eh jar 

— SpeaV 

Jb ay 

ehj?_wy 

ohBP-^y 
•_gpcsny 
CP— 


18 
13 
7 
4 

a 

2 
2 
2 
1 


*hgp_y 
•_ffpe_y 

•_sp y 

eh _po_y 
_h__o_Y 
— flPCJT 


5 
2 

i 
l 
i 
i 
1 
l 


• jpcau 
•hgp_A_ 


4 

3 
2 
2 
1 




• J?_ MOP 
eh^pcaiy 


1 
1 












Sum 

COG»(*> 


323 
45 




215 
30 




122 
17 


60 
B 



Fig. 3. Phylogenetic patterns in COGs. Letter codes as In Fig. 2 (ignore case); an underline indicates 
absence of the respective species. Shading indicates the eight most frequent patterns. 



are preserved in the nearly minimal gene 
complements of the mycoplasmas. Two of 

the unique patterns, namely w gpc_y n and 

u _hgp_y, n might have evolved through 
horizontal transfer of typical eukaryotic 
genes into bacterial genomes. The latter 
pattern is of particular interest as it involves^ 
the choline kinase gene common to a numU 
ber of bacterial pathogens and implicated in 
pathogenicity (25). Two of the COGs with 

unique patterns, "h c_y" and M e_gp_my," 

include highly conserved but uncharactery 
ized proteins whose functions could be preO 
dieted only by detailed analysis of conO 
served protein motifs (Table 1). These exO 
amples demonstrate the potential for proU 
rein function prediction inherent in the 
construction of the COGs themselves. 

The sampling of genomes we compared^ 
is small and biased, and when a more comU 
plete set is available, the distribution of 
COGs by phylogenetic patterns is likely to 
change significantly; for example, many 
patterns that are currently rare may become 
common when larger genomes from the 
GramQ>ositive bacterial lineage (such as^ 
Bacillus subtiUs) become available. NeverO 
theless, we believe that the language of 
phylogenetic patterns will become even 
more useful for the description of relationU 
ships between multiple genomes. 

Connecting and 
Expanding the COGs 

Ancient families of paralogs that span a 
broad range of taxa are well known (26). 
Accordingly, a number of COGs are related 
to each other and can be connected into 
superfamilies. In order to elucidate the suO 
perfamily structure of the COG collection, 
we used the recently developed PS1 (BLAST 
(positionQpecific iterative BLAST) proU 
gram, which combines BLAST search with 
profile analysis (27). Two COGs were conU 
sidered connected if at least two of the 
proteins from the first COGJiit members of 
the second COG in the PS I UBLAST search, 
and vice versa. Clustering by this criterion 
produced 58 superfamilies including 280 
COGs. 

Compared to COGs themselves, the suU 
perfamilies are a higher level of protein^ 
classification. Typically, they include conO 
served motifs that are determinants of a^ 
distinct biochemical activity, which, howl) 
ever, may be required for a variety of celluO 
lar functions. For example, the largest suO 
perfamily contains 53 COGs with 863 proU 
teins, all of which contain conserved motifs 
typical of ATPases and GTPases but are 
involved in a broad range of processes from 
DNA replication to metabolite transport 
(28). 

Superfamilies and their signature motifs 



634 



SCIENCE • VOL 278 • 24 OCTOBER 1997 • www.sciencemag.org 



AJtHCtES 



will be useful in classifying proteins that 
have evolved to an extent that they canO 
not be assigned to any COG but still 
retain a conserved motif. We sought to 
detect such proteins with distant, subtle 
similarity to COGs that might be encoded 
in the analyzed genomes. The PSIIBLAST 
analysis (27) detected "tails" of distantly 
related proteins (a total of 3686) for 321^ 
COGs, increasing the total number of proU 
teins connected to COGs to 10,332 (58% 
of the entire protein set from complete 
genomes). 

Because apparent orthologs from at least 
three major clades were required to form a^ 
COG, there are potential new COGs hidU 
den among the results of the comparison of 
protein sequences from complete genomes 
{11). Clustering by sequence similarity the^ 
proteins not included in COGs (24) resultU 
ed in 443 groups with members from two 
clades. Predictably, the greatest number, 
204, were from the cyanobacterial and 
GramGiegative clades, followed by 67 
groups combining yeast and M. jannaschii. 



Many of these groups are likely to become 
COGs once additional genomes are includU 
ed in the analysis. 

Prediction of Protein Functions 
with the COG System 

The COG system allows automatic funcO 
tional and phylogenetic annotation of 
genes and gene sets (29). As in the proceU 
dure used for the construction of the COGs, 
the criterion for adding likely orthologs 
from other genomes to the COGs is based 
on the consistency between the observed 
relationships. A protein is compared to trre 
database of protein sequences from comU 
plete genomes (J J) and is included in a 
COG if at least two BeTs fall into it. Given 
that the COGs were constructed from proU 
teins encoded in complete genomes, it is^ 
not a requirement that newly included proU 
teins also originate from a complete geO 
nome. Indeed, while the unsequenced porU 
tion of a genome may encode proteins with 
the highest similarity to those included in 



COGs, the BeTs will not change for the 
products of already sequenced genes. 

As a demonstration of the principle 
coupled with additional characterization 
of the COGs themselves, the sequences of 
proteins with known three (dimensional 
structures from the PDB database (30) 
were compared to the protein sequences 
encoded in complete genomes. The "two 
BeT" procedure resulted in proteins with 
known three dimensional structure being 
included in 183 COGs, of which one was 
shown to be a false positive by subsequent^ 
alignment analysis. Thus, structural inforU. 
mation could be inferred for at least 25%^ 
of the COGs. In most cases, the structurU 
ally characterized protein (from E. coU or 
yeast) actually belongs to a COG or is a 
closely related homolog of the proteins 
forming a COG. 

Some of the predictions, however, proU 
vide significant functional and structural 
inferences. Of particular interest are (i) 
the possibility of modeling the nuclease 
domain of polyadenylate cleavage factors 



Table 1. IHqu. phylogenetic patterns among COGs. The pattern designations are as in Fig. 3; each COG ID includes a tetter Indicating the functional 
category, to which the constituent proteins belong (Bg. 2). 



Partem and 
COG ID 



e__gp_m_ 
COG0213F 

e p y 

COG0246G 



e_gp_y 
COG0095H 



eh_pc_y 
COG0604R 



_h__o_y 
COG067BR 

COG0631R 

— gp.my 

COG0423J 



e_gp_my 
COG0622R 



eh_pcmy 
COG007BE 

-hgp-y 
COG0510M* 



Proteins 



Activity or function 



Comment 



DeoA-MG051-MP09D- 

MJ0657 
MtID, UxaB, UxuB, Ydfl, 

YelQ-MP1 90-YEL070W, 

YNR073C 

LplA-MG270-MP450- 
(sll0809)-YJLO46w 



AdhC + 1B£ coli 

protelns-MP27B-sll0990, 

slr1192-YBR046c + 19 

yeast proteins 
HIN1693J-SII1621- 

YLR109W 
MG108-MP586-SU1771- 

slll 033-sll06D2-YDLOOBw 

+ 6 yeast proteins 
MG251-MP483-MJ0228- 

YPR0B1C, YBR121C 



D2300-MG207, 
MP029-MJ0623, 
MJ0936-YHR012W 

Argt, ArgF, 

YgeW-HlNO0l2-MP531- 

sll0902-MJ0881-YJL088w 
HIN0938-MG356, 

MP310-YDR147W, 

YLR133W 



Thymidine phosphorylase; 

salvage of deoxypyrimldines 
Mannltol-1 -phosphate and 

other hexuronate 

dehydrogenases; hexuronate 

catabolism 
Upoate-proteln ligase A; ligation 

of lipoate to apoproteins of 

pyruvate dehydrogenase and 

other lipoate-dependent 

enzymes 

Alcohol dehydrogenase class III 

and related Fe-S 

dehydrogenases; various 

catabolic pathways 
Glutaredoxin-like membrane 

protein (prediction) 
Protein serine and threonine 

phosphatase 

Glycyf-tRNA synthetase 
(eukaryotic and Gram-positive 
type) 



Phosphoesterase (prediction) 



Ornithine carbamoyttransferase; 
arglnine biosynthesis 

Choline kinase (prediction) 
involved In ^polysaccharide 
biosynthesis 



Nonessential gene In £ cdr, apparent orthologs found in 
other Gram-positive bacteria and in humans (35). 

Nonessential genes in £ coff; accessory reactions of 
carbohydrate metabolism (36). 

There are two unrelated classes of lipoate-protein ligases; 
£ coil and yeast encode both forms; H. influenzae and 
Synechocystis sp. encode the B form (included in a 
separate COG); SII0809 Is a distant homolog of the A 
form (37), which was not automatically included in the 
COG but was detected with PSI-BLAST. 

Highly conserved protein family distinct from other Fe-S 
oxidoreductases. 

The H. influenzae protein contains an additional 

thioredoxin-like domain. 
Serine and threonine protein phosphatases are abundant 

in eukaryotes but not in bacteria {33). 

Gram-negative bacteria and Synechocystis encode a 
distinct glycyl-tRNA that appears to be unrelated to the 
eukaryotic and Gram-positive type; the closest relative of 
this COG In £ coli and K Influenzae is prolyl-tRNA 
synthetase (24). 

Highly conserved protein family that snares only modified 
catalytic motifs (detected by PSI-BUAST; P ~ 0.004) 
with other phosphoesterases, incluciing protein 
phosphatases. . . 

Amino acid metabolism appears to be completely missing 
in M. genitatium, but residual reactions may occur in M. 
pneumoniae. 

Enzyme common to several bacterial pathogens and 
eukaryotes; contributes to pathogenicity (25). 



•This COG was added to the collection by cluster analysis. 



www.sciencemag.org • SCIENCE • VOL. 27B • 24 OCTOBER 199? 



635 



(31) with the beta lactamase structure, 
(ii) the presence of an acylphosphatase 
domain in hydrogenase expression factors, 
which form a highly conserved COG, and 
in a number of uncharacterized proteins, 
and (iii) the connection between a unique 
carbonic anhydrase and an acetyltransU 
ferase family (Table 2). . 

Probably the most important apphcaU 
tion of the COGs is functional characterU 
• ization of newly sequenced genomes. In 
the preliminary analysis of the recently 
published genome of the major human 
bacterial pathogen Helicobacter pylori (32), 
813 proteins (51% of the gene products) 
from this bacterium were included in 453 
preQxisting COGs and 143 new COGs 
(33). In spite of the fact that many H. 
pylori proteins are highly similar to hoU 
mologs from E. coli and other bacteria and 



have been explored in detail (32), this 
analysis produced over 100 additional 
functional predictions (33). 

Conclusions and Perspective 

The COGs bring together the fields of 
comparative genomics and protein classiU 
fication. Among the numerous possible 
approaches to protein classification, the 
COGs appear to be unique as a prototype 
of a natural system, which has as its basic 
unit a group of descendants of a single 
ancestral gene. Typically, such a group is 
associated with a conserved, specific funcU 
tion, so that the inclusion of a protein in 
a COG automatically entails functional 
prediction. 

Each COG contains conserved genes^ 
from at least three phylogenetically disLI 



tant clades and, accordingly, corresponds 
to an ancient conserved region (ACR). 
Previous analyses have indicated that the 
total number of distinct ACRs is likely to 
be less than 1000 (34). Thus, even with 
the limited number of complete genomes 
currently available for analysis, the COGs^ 
have already captured a substantial fracU 
tion of all existing highly conserved proO 
rein domains. With more genomes includU 
ed in the system, the discovery of addiU 
tional COGs should gradually level of\ 
with the great majority of the ACRs enU 
coded in the added genomes fitting into 
already known COGs. 

With the forthcoming flood of genome^ 
sequences, a coherent framework for underU 
standing these genomes from both the funcU 
tional and evolutionary viewpoints is a 
must. We regard the current collection of 



Table 2. Structural and functional predictions tor uncharacterized proteins in COGs. 



Phyiogenetic 
pattern and 
COG ID* 



e_gpcmy 
COG0595R 



eh cmy 

COG0B07R 



ehgpc_y 
COG0596R 



e cm_ 

COG006BC 



Proteins in COGt 



PhnP, 
BaC-2g-2p-5c-Bn> 
YLR277C, YMR137C, 
YKR079C 



SseA, PspE, GIpE 
YibN. YbbB. YnjE, 
YgaP-2h-5c-MJ0052-4y 



PldB, MhpC, YcdJ, 
YnbC-HIN0065- 
MG020-MP132-6C- 
YNR0640, YKL094W 

HypF-sll0322-MJ07l3 



Activity and 
function 



Predicted 
Zn-dependent 
hydrolases 



Predicted 
sultur- 

transferases 



Predicted 
hydrolases and 
acyftransferases 



Hydrogenase 
maturation 
factor 



Homolog In PDB* 
•BeTs detected (no.) 
•Lowest P with a COG 
member 

Beta -lactamase 
(1BMC) 
•2 

•0.039 



Rhodanese (1 RHD, 
20RA.10RB) 
•2 

.10-" 



Upases (2UP, 
1TAHIB, 1CVL) 
♦3 

•8 X 10- 5 

Acylphosphatase 
(1APS) 
•2 

•2 X 1CT 5 



e cm. 

COG0B63R 



CaiE, YrcAYctoZ-sin636, 
SII1031-MJ0304 



Predicted 
carbonic 
anhydrases 



Carbonic anhydrase 
from 

Methanosarcina 
thermophfia (1THJ) 
•3 

•10- 29 



Comment 



Activity is not known for any protein in this 
ubiquitous COG. Biochemical and genetic 
data indicate that YLR277C is involved in 
messenger RNA 3'-end processing (37), 
whereas YMR1 37c is DNA cross-link repair 
protein SNM1 {39). A motif including the 
Zn-coordinating histidines of beta-lactamase 
is conserved. 

The sulfurtransferase activity of SseA has been 
demonstrated (40), but the rest of the 
proteins in this COG have no known activity. 
PspE (phage shock protein), GlpE 
(uncharacterized protein involved in glycerol 
metabolism), and other small proteins 
correspond to one of the two rhodanese 
domains. 

PldB Is known to possess triglyceride lipase 
activity (44). AR other proteins in the COG 
have not bBen characterized but now can be 
predicted to possess the ct- or p-hydrolase 
fold. 

HypF is required for hydrogenase biosynthesis 
(42), but no biochemical activity is known. The 
-100 amino acid, NH^temnlnal domain 
aligns with acylphosphatase, with the catalytic 
residues conserved, suggesting that HypF 
orthobgs indeed possess acylphosphatase 
activity. A PSI-BLAST search with this domain 
as the query detected five additional likely 
acylphosphatases, namely £ coli YccX and 
M.Jannaschff MJ0B09, MJ0553. MJ1331, 
and MJ1405 (43). 

The biochemical activity of th e proteins in this 
COG is not known. They show not only 
conservation of hlstidlne residue comprising 
the active center of this unusual carbonic 
anhydrase {44) but also sig nificant similarity to 
acetyttransf erases of the isoleuclne patch 
superfamily (45). suggesting an unexpected 
connection between the two types of 
enzymes. 



•The designations are as In Table 1 and flg. 3. 
accession Is indicated in parentheses, 



t 2g indicates two proteins from M. gsnnaSum, 2p indicates two proteins from W pneumoniae, and so forth. 
SCIENCE • VOL 278 • 24 OCTOBER 1997 • www.scienctrnag.org 



$The PDB 



COGs as a crude first version of such a^ 
framework. Inclusion of additional, phyloU 
genetically diverse genomes and further deu 
velopment of the procedures used to derive 
and analyze COGs will hopefully result in 
refinement of this system, making it a solid 
platform for genome annotation and evoluU 
tionary genomics. 

REFERENCE S AND NOTES 

1 R. D. Fteischmann ef a/., Sdence 269, 496 (1 995). 

2 C. M. Fraser ef al , bid. 270, 397 (1 995); R. Himmei- 
reicb ef a/., Nucleic Adds Res. 24, 4420 (1995); J. 
Kaneko ef a!. , DNA Res. 3, 1 09 (1 99S): F. R. Blattner 
et al., Science 277, 1453 (1997). 

3. C. J. Butt ef a/.. Science 273, 1058 {1998). 

4. A. Goffeau et al. , Ibid. 274, 546 (1 996);.H. W. Mewes 
et a/.. Nature 387, 7 (1 997). 

5. C. R Woese, Cu/r. B/oL 6, 1060 (1996); G. J. Ofeen 
and C. R Woese, Calf B9. 991 (1997); E V. Koonln, 
Genome Res. 7, 418 (1997). _ _ ^ 

6. EV.Koorih,A.RMLisheglaruK.ERudd.Cu/r.B/oi 
B, 404 (1995); E V. Koonln and A. R Mushegtan, 
Cutt. Opin Genet Dev. 6, 757 (1996). 

7. W. M. Fitch, Syst ZooL 1 9, 99 (1 970). This definition 
may not embrace aO of the complexity of relation- 
ships between genes In different genomes. For ex- 
ample, If genes A and B are paralogs encoded In 
genome 1, and A' and B' are their respective or- 
thologs In genome 2. what Is the appropriate de- 
scription of the relationship between A and B'? They 
formally are not paralogs. even though a generalized 
definition might Include such cases. Furthermore, 
one-to-many and many-to-many orthologous rela- 
tionships evidently exist. 

8. W. M. Rtch, Phtios. Trans. R Soc London Ser. B 
349,93(1995). 

9. RL Tatusov ef a/., Curr. Biol. 6. 279 (1996). 

10 E V. Koonln. A. R Musheglan, M. Y. Galperin, D. R 
Walker, Mo/. Microbiol. 25, 619 (1997). 

11. The protein sequences were from the original refer- 
ences (7-4), with modifications (for example, tenta- 
tive correction of frame-shift errors) and additions 
(previously unreported predicted genes) made for £ 
col? (E V. Koonln and R L Tatusov, unpublished 
observations; K. E Rudd, personal communication), 
H influenzae (9), M. genttaBum and M. jannaschB 
(70), and S. cerevfcfee (T. J. Wolfsberg and D. 
Landsman, personal communication). The list erf sys- 
tematic names for all £ cofi genes was provided by 
K. Rudd, and the names for all yeast genes were 
provided by T. Wolfsberg and D. Landsman; the H. 
Influenzae genes were renamed as previously de- 
scribed (9); the gene names for the other species 
were from the original publications. The resulting 
protein database from complete genomes used In aB 
comparisons contained 4283 sequences from £ 
coff, 1703 sequences from hi influenzae, 468 se- 
quences from M. gerttatium, 677 sequences from 
M. pneumoniae, 31 68 sequences from Synechocys- 
tis sp M 1736 sequences from M. jannaschB, and 
5932 sequences from S. cerevisiae, totaling 17,967 



sequences. This sequence set Is available on the 
World Wide Web at http://www.ncbl.rtm.nHgov/ 
COG. Afl pafrwise comparisons between these se- 
quences were performed using the BL/STPGP pro- 
gram, which Is based on an enhanced version of the 
BLAST algorithm and includes analysis of local align- 
ments with gaps (26). Ftedicted colled coB regions h 
protein sequences were masked before the compar- 
ison using the batch version of the C01LS2 program 
[A. Lupas, Metfiods BnzymoL 266, 51 3 (1 996); D. R 
Walker and E V. Koonln, ISMB 5, 333 (1 997)), and 
additionally, regions of low complexity were masked 
using the SEG program with default parameters 
|J. C. Wootton and S. Federhen, Methods EnzymoL 
266, 554 (1996)). Before the detection of triangles of 
BeTs, paralogs were identified as those proteins 
from the same lineage that showed greater similarity 
to each other than to any protein from another lin- 
eage. For the purpose of triangle formatloa paralogs 
were treated as a group. The algorithm further in- 
cluded verification that the BeT6 included in a triangle 
formed a consistent multiple alignment; triangles that 
did not contain a conserved motif were disregarded. 

12. Although the exact solution depends on the amino 
acid composition and size of the particular proteins, 
under zero approximation, If B (from genome b) Is the 
BeT for A (from genome a), and C (from genome c) is 
the BeT for B, the probability that C is the BeT for A 
by chance is close to 1/N, where N is the number of 
genes in genome c, or —0.001. 

13. C. R Woese, Microbiol Rev. 51, 221 (1987); 

R Overbeek, G. J. Olsea J. Bacterioi 176, 

1 (1994); N. R Pace, Science 276. 734 (1997). A 
BeT to a given ctade was registered If detected In any 
of the constituent species, for example, In £ col or 
H influenzae for the Gram-negative bacteria. 

14 H Watanabe and J. Otsuka, Comput Appl. Biosd. 
' 11, 159 (1995); E V. Koonln, R L Tatusov. K. E 

Rudd, Methods Enzymol. 266. 295 (1996). 

15 A schematic visual representation erf the search re- 
sults was used for this analysis (T. L Madden, R L 
Tatusov, X Zhang, Methods Enzymol. 266, 131 

1 6. A single-linkage clustering procedure was used with 
random match probability, P < 0.001 , as the cutoff 
(74). 

17. A searchable database of COGs Is available at http j7 
www.ncbl^m^lh.gDv/COG. Each COG was as- 
signed a unique identification number, which In- 
cludes a letter lor the functional category (79) and a 
number (see examples in Fig. 1 and Tables 1 and 2). 

16. M. Lonetto, M. Grfbskov, C. A. Gross, J. Bacterioi. 
174,3843(1992). 

19. The broad functional categories of proteins were as 
defined previously ®, except that transcription was 
separated from replication, recombination, and re- 
pair. This classification is a modification of the sys- 
tem originally developed for £ coff proteins [M. Riley. 
MicrciM Rev. 57, 862 (1993)). 

20. A partially similar representation of some of the pro- 
tein families from complete genomes has been re- 
cently published [R A. Clayton, O. White, K. A. 
Ketchum, J. C. Venter. Nature 387, 459 (1997)). 

21. R F. DooDttle, D.-F. Feng, S. Tsang, G. Chao, E 
Little, Science 271 , 470 (1996). 



22. R Hlmmelrelch et at.. Nucleic Acids Res. 25, 701 
(1997). 

23. A. R Musheglan and E V. Koonln. Proc. Natl. Acad. 
Sd. USA. 93, 10268 0996). 

24. E V. Koonln, A. R Musheglan, P. Bork, Trends Gen- 
et 12. 334 (1996). 

25. J. N. Welser, M. Shchepetov, S. T. Cheng, Infect. 
Immun. 65,943(1997). 

2a J. P. Gogarten ef at. Proc Natl. Acad. SgLU$* 
66, 6661 (1989); N. Iwabe el a/., Ibid., p. 9355; J. P. 
Gogarten, E Hflario, L Olendzewski, In Evolution of 
Microbial Ufa, D. McL Roberts. P. Sharp, G. Alder- 
son, M. Collins, Eds. (Cambridge Univ. Press, Cam- 
bridge. 1996), pp. 267-292. 

27. S. F. Altschul ef at, Nucfeb Acids Res. 25, 33B9 
(1997). The probeblltty of a random match, P < 
0.001, was used In all PSI- BLAST searches. 

28. J. E Walker. M. Saraste, M. J. Runswick, N. J, Gay, 
EMBO J. 1, 945 (1 982); A. E Gorbalenya and E V. 
Koonln, Nucleic Adds Res. 17, 6413 (1989); M. 
Saraste, P. R Slbbald, A. Wlttinghofer, Trends Bio- 
chem.Sd 15, 430 (1990). 

29. Protein sequences can be submitted for searching 
against COGs at http://wwwjicbLnlm.nlh.gov/ 
COG/cognltor.html .. _ 

30. F. C. Bernstein ef el.. J. MoL Biol 112, 535 (1977). 

31. a Chanfreau, S. M. Noble. C. Guthrie, Sdence 274, 
1511 (1996); A. Jenny, L. Mlnviefle-Sebastla, P. J. 
Preker, W. Keller, ibid. 274 , 1 51 4 (1 996); G. Stumpf 
and H. Domdey, Ibid., p. 1 517. 

32. J.-F. Tomb ef a/., Nature 38B, 539 (1997). 

33. E V. Koonln, R L Tatusov, M. Y. Galperin, M. N. 
Rozanov, unpublished observations. 

34. P. Green era/., Science 259, 1711 (1993). 

35. J. Neuhard and R. A. Ketln, in Escherichia cofi and 
Salmonella- Cellular and Molecular Biology, F. C. 
Neldhardt ef at, Eds. (American Society for Microbi- 
ology, Washington, DC, ed. 2, 1996), pp. 580-599. 

36. E C. C. Un, ibid., pp. 307-342. 

37. T. W. Morris, K E Reed, J. E Cronan Jr., J. Sarte- 
not 177,1 (1995). 

38. P. Bork, N. P. Brown. H. Hegyl, J. Schuta, Protein 
Sc* 5, 1421 (1996). 

39. D. Richter, E Niegemann, M. BrendeJ, Mo/. Gen. 
Genet 231, 194 (1992); R. Wolter, W. Slede, M. 
Brendel, fcfcl 250, 162 (1996). 

40. H. Hama, T. Kayahara, W. Ogawa, M. Tsuda, T. 
TsucWya, J. Btochem 115, 1135 (1994). 

41. T. Kobayashl et a/., Ibid. 98", 101 (1985). 

42 A Colbeau et at. , Mo/. Microbiol B. 1 5 (1 993). 

43. M. N. Rozanov and E. V. Koonln. unpublished ob- 
servations. 

44. B. E Alber and J. G. Ferry, Proc. Natl. Acad. Sd. 
USA. 91 . 6909 (1 994); C. HOsker ef al , EMBO J. 1 5, 
2323 (1996). 

45. E V. Koonin. Protein ScL 4. 1 606 (1 995); M. N. Rozanov 
and E V. Koonln, unpublished observations. 

4a We thank A. Schaffer tor modifying the PSl-BLAST 
program; R Walker. H. Watanabe, and M. Rozanov 
for valuable help with data analysis; K. Rudd, T. 
Wolfsberg, end D. Landsman for unpublished data; 
and P. Bork, M. Galperin, M. Gelfand, A. Musheglan, 
P. Pevzner. M. Roytberg, M . Rozanov, and R Walker 
for helpful discussions. 



.sciencemag.org • SCIENCE • VOL. 278 • 24 OCTOBER 1997 



637 



290 

bioinformatics 



Douglas R. Smith. 



Microbial pathogen genomes - 
new strategies for identifying 
therapeutics and vaccine targets 



Advances in high-throughput DNA-sequencing techniques have g.ven us tte 
unprecedented abfrty to rapidly determine the nucleotide sequences of entre 
bacteria, genomes. The appfication of these methods to the genomes of rmcro M 
pathogens, combined w*h efficient analytical tools and. genome-scale approaches 
for studying gene expression, is' revolutiohdng our approach to the selecton of 
targets for drug screening and vaccine development This is bring,ng new Irfetqthis 
important, but longneglected, field of research. 



The decision, several years ago. by ^USDe^rm^ 
of Energy, the National Institutes of Health fTH) and 
severSernadonal funding agenri =s to embark upon 
programs to map and sequence the human genome 
L led to a number of important .technological 
advances that are beginning to have an impaa mother 
areas of biology. Among these advances are the &tfd- 
opment of automated methods for the generanon of 
W amounts of raw DNA-sequenring information, 
computer software for rapidly processing and analyz- 
bglrimary sequence data, and techniques for die 
rapidassembly of shotgun sequencing reads, even from 
entire bacterial genomes. Efficient algorithms for simi- 
larity searching allow the rapid idenoficaoon of pro- 
tein-encoding sequences that are homologous to other 
Snes, the sequences of which are held in public and 
Private databases; as from April 1996, approximately 
500 megabases (Mb) of nucleotide *q»«« ^ 
contained in GenBank, and approximate*? -200 000 
sequences were held in the SWISS-PROT/Genpept/ 
PIR database of non-redundant proteins. Combined 
with the wealth of biochemical informanon that 
is archived in public databases, it has become possible 
to describe rapidly the full repertoire of genes in a mi- 
crobial genome, and to predict many of the meta- 
bolic pathways that an organism may utilize. , 

Progress in this field has been stimulated by the inter- 
ests of the biotechnology and pharmaceuocal indus- 
tries in using genome-sequencing data as a baa for 
drug discovery. la turn, this has led to the develop- 
ment of proprietary databases containing ; genomic 
information, which provide the basis for m aba expen- 
ments to identify novel targets for drugs, and tor 

D. R. Smith (smiih@aiL.com) is Bt Genomt Vimptutio Coiporaiion, 
100 BtavaSuat, Wallham, MA 02JS4, USA. 



laboratory experiments to identify genes chat perform 
critical functions. This article summarizes some recent 
developments in this important area, focusing on bac- 
terial sequences, and provides examples to ffiustraw 
bow genome-sequenring information from microbial 
pathogens can be used to select targets for vaccine and 
drug development The overall process used to pro- 
ceed from sequence generation to target validanon is 
illustrated in Rg. 1. 

Large-scale sequencing of bacterial genomes 

Many laboratories use automated sample-prepar- 
ation techniques and fluorescence-based gel readers 
fsuch as that produced by Applied BiosTstems Inc., 
ABI); Foster City. CA USA] for the large-scale 
sequencing of bacterial genomes. These imtruments 
have the advantage that they are efficient, and relatively 
easy to set up and operate. A few laboratories use com- 
puter-assisted multiplex sequencing to achieve the 
same end'. In multiplex sequencing, sample consist- 
ing of pools of up to 20 plasmids are processed througn 
sample preparation and gel electrophoresis, and^the 
resulting sequences are determined from, dectroblots 
of the gels by hybridization with radioactive or fluor- 
escently labeled probes. This technique canbe used to 
generate 40 films (or digitized images) from eacn 
sequencing geL Although multiplex seque nang is effi- 
cient at producing large amounts of 'shotgun* data, it 
is more difficult to set up and operate irx the labora- 
tory than is fluorescence-based gel sequencing, and it 
is not suited to directed-finishing stra-tegies. ABI 
machines are used in the author's laborato ry to gener- 
ate primer-directed reads for finishing and gapdosure. 

During the past year, a group at The Institute : for 
Genomic Research (TIGR; Gaithcrsburg, MA USA) 
reported the complete seouences of Haemophilus 



TBTECH AUGUST 1996 [VOL 14) 



CDpyrigW © 1996. Elsevier Science ud. At rights reserved. 0167 - 



Exhibit 



bioinformatics 



influenzae (1.8 Mb), a major cause of respiratory infec- 
tions and meningitis, especially in children 2 , and of 
Mycoplasma gtnitalium (0.6 Mb), which causes ure- 
thritis 3 . Approximately 1.6 Mb of contiguous 
sequence from the 4.7 Mb Escherichia colt genome has 
been published 4 , and the sequencing of a further 2 Mb 
was reported at the 1995 Genome Sequencing and 
Analysis VII '(GSA-VII) meeting 5 . The genome of 
Helicobacter pylori (1.7 Mb), the major cause of 
stomach, ulcers, has been sequenced by Genome 
Therapeutics Corporation (GTC; Waltham, MA, 
USA) under a privately funded rrucrobial-pathogen 
sequencing program. More than half (1.5 Mb^ of the 
2.8 Mb genome of Mycobacterium leprae (the eriologic 
agent of leprosy) has also been sequenced by GTC, 
and is available through GenBank, the GTC web 
" site <http//rwww.cric.com>, and through MycDB 
<htrp://wwbiochem.krk se/MycDB.html>, which 
contains mycobacterial genome mapping and sequence 
information 6 . 

Other microbial pathogens that are currently being 
sequenced include Neisseria gonorrhoeae (University of 
Oklahoma, Norman, OK, USA), Streptococcus pyogenes 
(University of Oklahoma), Treponema pallidum 
(University ofTexas, Houston, TX, USA, andTIGR), 
Mycobacterium tuberculosis (GTC and the Sanger 
Centre, Hinxton, Cambridge, UK), and Staphylococcus 
aureus [GTC, and Human Genome Sciences (HGS; 
Rockvffle, MD, USA)]. 

In addition to these pathogens, the genomes of sev- 
eral archaebacteria and other non-pathogens are being 
sequenced These include Methanococcus janaschii 
(TIGR.), Pyococcus juriosis (University of Utah, Salt 
Lake City, UT, USA), Sulfobbus solfataricus Palhousie 
University, Halifax, Nova Scotia, Canada), and 
Pyrobaculum aerophilum (California Institute of 
Technology, Pasadena, CA< USA, and University of 
California, Los Angeles, CA, USA). The 1.7 Mb 
genome of the archaeon Methanobacterium thermo- 
autotrophicum is near completion at GTC (Re£ 7). 
Approximately 2Mb of the 4.1Mb Badlhts subtilis 
genome has now been sequenced by a consortium of 
European and Japanese laboratories, and the project 
may be completed by the end of 1996 (Ref. 8). 
Approximately 1 Mb of genomic sequence from the 
2.7 Mb genome of the cyanobacterium Synechocysds 
sp. 6803 was recently published 9 . 

Within the next couple of yean, therefore, we can 
expect an explosion of bacterial-genome sequence 
information from species representing a variety of 
phylogeneric lineages, including many pathogens. 

Pharmaceutical companies have shown considerable 
interest in .using pathogen genomics to facilitate the 
development of vaccines and small-molecule thera- 
peutics. For example, researchers at GlaxoWellcome 
have sequenced a substantial fraction of the H. pylori 
genome to assist in the process of drug discovery. Over 
the past year, GTC has formed two research alliances 
with pharmaceutical companies to take advantage of 
sequences from microbial pathogens: one with 



(^DNA sequence^) 

I 



Gene identification 



i 



Similarity search 



(^Gem 



Genes with homologs 







Select targets 



genes 



Gene disruption 



[VET* 
Genetic fingerprinting 
Expression analysis 
Comparative analysis 
Metabolic databases 







Motif analysis 



Assign function 



Molecular modeling 



Biochemical assays 



Drug sensitivity assays 



(^Lead compounds^ 



7^ 



L- J 1 



Figure 1 

Row diagram illustrating the process by which a microbial genome sequence is 
analysed and the information is used to direct experiments and aid in target selec- 
tion for therapeutics development The individual steps are referred to throughout 
the text In the case of vaccine candidates, gene products from selected targets are 
expressed and tested in animal models. 

biorics and vaccines to treat H. pylori infection, and 
one with Schering-Plough (Union, NJ, USA), to 
develop broad-spectrum antibiotics and vaccines. 
Although the genomic route to drug discovery for 
bacterial pathogens is new and remains unproved, the 
basic paradigm (outlined below) of gene identification, 
followed by functional analysis and drug screening, is 
well established. Thus, it is likely that more companies 
will become involved, and that in the the future, ad- 
ditional research alliances between genomics com- 
panies and the pharmaceutical industry will material- 
ize in this area. 

From sequence to genes 

The first task when confronted with an entire bac- 
terial-genome sequence, is to identify all the genes. 
This can be accomplished using a variety of tech- 
niques, but the most successful approaches use a combi- 
nation of rcading-fiame and codon-usagc analysis, 
together with similarity searching, to identify putative 
genes with homology to previously described se- 
quences. Commonly used tools include GeneMark 10 , 
GenomeB'rowser", BLAST (Ref. 12), and highly 
r«Tnllfli*pd imDlementarions of the Smith-Waterman 



292 

bioinformatics 




"Well, according to the algorithms, it folds like this! 1 



alignment, such as BLAZE, or MPsrch (Re£ 13). In 
general, organism-specific codon usage is highly pre- 
dictive for bacterial genes, but its effective use depends 
on the existence of sufficient information to generate 
accurate codon-usage matrices. In some, cases, subsets 
of genes within an organism will exhibit codon-usage 
patterns that -deviate rignificantiy from the norm 14 . 
Such genes are thought to represent evolutionary 
recent acquisitions by phage transduction, conju- 
gation, or some other form ofhorizontaJ transfer from 
other organisms. If enough of these genes are present, 
codon-usage tables of genomic subsets can be con- 
structed to identify them. Translation^ start sites can 
be identified by the occurrence of start codons that 
coincide with abrupt changes in codon 'usage, the in- 
itiation of homology to previously characterized 
genes, or the presence of Srine-Dalgarno sequences" 
Automated analysis tools (such as GenomeBrowser 11 ) 
that provide a graphical display of open reading frames 
(ORP s), codon usage, database homologies and other 
features, make the task of identifying bacterial genes 
and their relationships with each other in the genome 
relatively strdghtforward. With the increasing pace of 
bacterial-genome sequencing, there is an emerging 
need for second-generation tools that will automate 
most of the laborious annotation process. 

From genes to function 

The second phase in the analysis of bacterial 
genomes is to identify the function of as many genes 
as possible. Currently, sequence homology is the most 
powerful tool A high degree of homology between 
the putative translation product of a newly identified 
gene and an enzyme whose function has been 
thoroughly studied in other organisms, provides strong 



TIBTCCH AUGUST 1995 (VOl 14) 



support for the function of that protein, especially if it 
is the only homolog in the genome under scrutiny. 
Other useful tools include programs that identify 
sequence morifr from databases such as PRO SITE 
(Kef. 16), BLOCKS (Ref. 17), BEAUTY (Ref. 18) 
and ProDom (Ref 19). If one is attempting to 
identify vaccine candidates, then aamiining highly 
expressed cell-surface proteins is relevant, so it is then 
useful to know whether a protein contains a secretion 
signal, even if nothingelse is known about it. Although 
the tools described here are very good at identifying 
homologies, 25-40% of the genes in a bacterial 
genome typically fail to show significant similarity 
with known proteins. 

Once the set of ■rimilarity-searching tools has been 
exhausted, one must return to molecular biology to 
further elucidate the function and expression pattern 
of predicted genes. Commonly used approaches to 
identifying essential genes in an organism include: the 
use of gene knockouts, disruptions using transposon- 
mediated mutagenesis, or homologous recombination 
with disrupted gene-constructs that contain an anti- 
biotic-resistance cassette. Gene disruptions can be 
generated in a variety of ways, mcludirig sophisticated 
'hit-and-run' approaches that interrupt a gene with- 
out introducing polar effects into downstream ORFs 
(Re£ 20). However, a gene-by-gene approach to the 
study of a whole genome is certainly rime consuming 
and labor intensive. 

The availability of large amounts of genome- 
sequence information has stimulated the development 
of new approaches to functional analysis on a genomic 
scale. This has been particularly true for researchers 
investigating yeast, where a concerted effort is being 
made to ascertain the function of every ORP in the 
genome. Such strategies include the conceptually 
simple, but technologically advanced, technique of 
making micro arrays of polymerase chain reaction 
(PCR)-ampli£ed gene sequences on glass slides to 
allow the fluorescence-based detection of quantitative 
hybridization signals from labeled cDISA probes on 
large numbers of genes simultaneously — perhaps even 
all the genes of an organism 21 . An ingenious PCR- 
based approach to efficient sequence-si gnature-based 
expression analysis has recendy been demonstrated 22 . 
For example, a technique termed 'genetic finger- 
printing' promises to replace mdrvidual gene knock- 
outs by a global transposon-mutagenesis approach 23 , 
insertions are induced en masse in a strain of interest, 
the strain is grown under a variety of conditions, 
and PCR. products are analysed to identify genes in 
which transposon hops are under-represented because 
the genes are required for growth 23 . A. conceptually 
similar dropout technique, which uses tagged trans- 
posons to identify the Salmonella typhimurium genes 
required for virulence in a mouse model, has been 
described 24 . 

Techniques that probe subsets of genes for a specific 
functionality, such as secretion or induction during 
growth in the host, have also been described. These 
techniques provide clones from which signature 



293 

bioinformatics 



sequences can be derived, so that corresponding 
genes can be identified by comparing them with 
the genomic sequence. The IVET (in vivo expression 
technology) technique, which detects gene fusions 
that result in the in vivo selectable expression of a 
defective purA gene or antibiotic-resistance marker, 
has been used to identify Salmonella genes, the expres- 
sion of which is induced when the pathogen is grown 
in mice 25 . Finally; protein microsequenriiig 26 and 
mass-spectrometry-based peptide analysis 27 have 
been used" to identify protein components (e.g. outer- 
membrane proteins) in partially purified mixtures, 
or to identify specific proteins separated by two- 
dimensional gel electrophoresis. Sequences generated 
in this manner can be used to correlate specific pro- 
teins with the gene sequences from which they are 
expressed. 

Thrget selection and validation 

The techniques described in the previous section 
can be used to identify genes in specific functional 
categories that may represent good targets for drug or 
vaccine development. In general, when developing 
new antibiotics, one is interested in genes that are 
essential under all growth conditions (and preferably 
even in quiescent cells), and for which inhibitors with 
useful chemical properties, such as permeability and 
low toxicity, can be identified. One advantage of 
having the entire sequence of a genome is that targets 
can be prioritized in terms of their activities and the 
properties of compounds that are known to interact 
with them. Even with the results of knockout or in 
vivo expression experiments, additional biological 
information can aid in narrowing down the field of 
choices. For example,- genes can be selected on the 
basis of their probable roles in intracellular metab- 
olism. Databases, such as EcoCyc (Re£ 28) or PUMA 
(Re£ 29), that describe known metabolic pathways 
can be helpful in this regard. Detailed structural infor- 
mation about homologs of identified genes (deter- 
mined using the Protein DataBank 30 ) can be used to 
assist in the molecular modeling of inhibitors (some . 
resources for molecular modeling 'can be found at 
Re£ 31). 

As more genomes are sequenced, it will become 
possible to identify genes that are unique to a par- 
ticular organism or group of organisms, or genes 
that are conserved in certain groups. Thus, for 
example, it will be possible to use electronic com- 
parison to identify genes that are present in H. pylori 
but not in other gut-dweBing bacteria such as E. coh\ 
providing a basis for the development of antibiotics 
specific to H. pylori. Although combinatorial chem- 
istries promise to speed up our ability to synthesize 
and screen large numbers of unique chemical entities, 
the sequence-based approach described here provides 
an avenue for the rational identification and selection 
of key targets for therapeutics development. Ulti- 
mate validation of the targets will, of course, require 
additional experiments such as protein expression* 
biochemical-assay development and animal 



studies to identify those with the most useful 
properties or inhibitors. 

Acknowledgements 

The sequencing of Mycobacterium leprae and 
M tuberculosis, and technology development for 
multiplex sequencing is supported by a NIH Genome 
Science and Technology Center grant 1P01-HG1106- 
01 from the National Center for Human Genome 
Research. The sequencing of Methano bacterium 
thermoautotrophicum is supported under the Microbial 
Genome Program by Grant No. DE-FC02-95ER61967 
from the Office of Health and Environmental Research 
of the US Department of Energy. The sequencing of 
Helicobacter pylori and Staphylococcus aureus is supported 
by Genome Therapeutics Corporation. Thanks to 
Brad Guild for comments on the manuscript. 

References 

1 Church, G. M. and JGefTer-Higgins. S. (1988) Sdena 240, 
185-188 

2 Heischmann, KD.daL (1995) Science 269, 496-5 12 

3 Fraser,CD.£/<rf. (1995) Sdence 270, 397-103 

4 Buriand, V, Plunfcert, G„ m, Sofia, H. J„ Daniels, D. L. and 
Blatmer, ft R. (1995) Nucleic Adds Res. 23, 2105-21 19 

5 Bnrfand. V„ Plunkert, G M in and Blatmer, F. R. (1995) in 
Genome Sdence and Technology I, P-16, Mary Ann Liebert 

6 Bergh. S. and Cole, S. T. (1994) MoL Mbobiol 12, 517-534 

7 Smith, D. R. et aL (1995) in Cenomt Science and Technology 1, P-46, 
Mary Ann liebert 

8 Devine, K. (1995) Timds BiotahnoL 13, 210-216' 

9 Kaneko, T.ctaL (1995) DNA Res. 2, 153-166 

10 Borodovsky, M. and Mclnnich, J. (1993) CompuL Chcnu 17, 
123-133 

11 Robison, K. R. and Church, G. M. (1995) <htro://www.belmonL 
com/gbJum)> 

12 Alschul S. F„ Gnh, V?^ Millet, W„ Myers, E. W. and 
Dpman, D.J. (1990>J. MoL Biol 215, 403-410 

13 MPsrch <iirrp^/www.ebijc.uk/searchcs/bIitriinnJ> 

14 Medigue, C Round, T., Vlgier, P., Henaur, A. and Danchin, A. 
(1991)/ MoL BioL 232, B5 1-856 

15 Shine, J. and Dalgarno, L (1975) Eur. J. Bio&cm. 57. 221-230 

16 Bairoch, A. (1*91) NudeicAdds Res. 19, 2241-2245 

17 HenikorX S. and Henikori; J. G, (1991) Nudek Adds Res. 19, 
6565-6572 

18 WorJey, K. C Wise, B. A. and Smith, R. F. (1995) Genome Res. 5, 
173-184 

19 Sonnhammer, E. L and Kahn, D. (1994) Protein Sd. 3, 482-492 

20 Link, A. J. and Church, G. M. . <hrrpV/rwodjnedharvard.edu/ 
bbgc/pK03JiimJ> 

21 Schena, M, Shalon, D„ Davis, R. D. and Brown, P. O. (1995) 
Science 270, 467-470 

22 Vdculescu,V.E^ Zhang, U Vogelsrein.B. and Xinzlcr.K.W. (1995) 
Sdena 270, 484-487 

23 Smith, V„ Boroan, D. and Brown, P. O. (1 995) Pnx. Nad Acad. Sd 
USA 92, 6479-6483 

24 Hensd, hLetal .(1995) Sdena 269, 400-403 

25 Mahan, bLJ.aal (1995) Pnx. Nad And. Sd. USA 92, 669-673 

26 Temptt, P., Link, A J., Riviere, L R^ Fleming, M. and Hicone, C 
(1990) Ekawphansis 11, 537-553 

27 James, P„ Quadroni, M-, Carafoli, E. and Gonnet, G. (1993) 
Biochem. Bloplxys. Res. Commun. 195, 58-64 

2B Karp. P. D. (1992) CABIOS 8, 347-357 

29 GaasterlandL T., Maloev, N„ Overbeek. Sclkov, E. <hvrp-J/ 
www. mcLanLgov/home/compbio/PUMA> 

30 Protein DanBank <htrp://www.pbd.bnLgov> 

31 <hap-J/ www.phannacy.wisc.edu> 



TTBTECH AUGUST 1996 (VOL 14) 



